PDF (Probability Density Function)
What is a PDF?
First a piece of jargon, a pdf or probability density function. Is something that describes a distribution e.g the well known bell curve for a normal distribution. If you were to select a range of values as possible e.g x between 2 and 3 the area under the curve between 2 and 3 divided by the total area of the distribution is equal to the probability the value would be in this range.
Why we need PDF's
We now have a data set for the parameters and we want to pick them in a smart way for running in the model. This is done by turning the data into a pdf.
Values could be randomly picked from the found parameters, but what about the ‘gaps’ between the points? These could be perfectly valid potential values too.
Converting a discrete dataset into a continuous one allows for better precision and makes sure your not missing any interesting parameter range, this continuous one is our pdf.
What are the options?
here are multiple ways to create a pdf from a data set. They can be split into two categories; parametric and non-parametric . Parametric meaning the data is used to find parameters for a known distribution that fits the data the best. We tried two Parametric methods fitting to a log normal distribution and fitting to a normal distribution.
Non parametric methods don’t use knowledge of a known distribution, we tried one such method “kernels” this works by giving each data point an associated wave the superposition of which gives your PDF.
All these methods have advantages and drawbacks which will be discussed Below.
Graphs to aid understanding
These graphs show the difference between the raw data and data sampled from a created PDF.
The graphs are displaying samples from a pdf used the lognormal method. As you can see from the histograms it has given much more information about the densely populated 0-100 Km value range, however it has unfairly reduced the amount of Km = ~ 500 and ~ 700 values. This is a potential downside of parametric methods and can be remedied by the kernel method. However looking at the ascending order values these Km values are still present in the sample of 500.
How it’s done - Finding PDF
Firstly the distribution function must be calculated. For the parametric methods the standard deviation and the mean are calculated from the data set, these are then put into the relevant equation. For the kernel method, the epanechnikov kernel was used for the waves associated with each point.
Practical considerations
How it’s done - implementation
Bins are created (The more the better as long as you have data to fill them while having good statistics.) The function is evaluated at the start position of each of these bins and the cumulative integral of the function is found. With this cumulative function returning the value of the integral at each bin start point in a array. These values are then all normalised so that the final value is equal to 1. As such the difference between two sequential values is equal to the probability that a value lies in that range.
A random number between 0 and 1 is now generated , this is used to select which bin to sample from using the knowledge from the previous paragraph.
Finally a newly generated separate random number between 0 and 1 is used to decide where in the bin the value will be, to illustrate if a bins start point was 3 and had a width of 2 if you get a random number 0.5 your value would be 4.
Noticed fail cases assessment of methods
As was mentioned briefly in the theory section each method has some drawbacks. In the following it is important to note that enzyme rate constants are distributed log-normally, what is really meant by this is the rate constant has one true value for our experiment but variation in experiments is log normal . Also that we tested these methods by using a self iterating code that updated guessed constants by fitting to data (an explanation of this code won’t be given in text but the code "emporer is given in our github.), in this the data was manufactured to have come from some specific parameters. But the code didn’t know about those. Note we had to take into account sloppy parameters etc, see parameter analysis section. This code as such should have evolved to have the manufactured parameters.
The gaussian method was generally very good in that it would always evolve towards the manufactured constants initially, however when close it would never make it reaching equilibrium a good distance away. Also it overshoots if coming from below zero with a high deviation in the tried initial parameter guesses.
Log normal method generally gets very close (within a couple %) to the actual parameters but can often overshoot.
Kernel method provides even better fits but can go completely crazy if the data set is too spread, but does take into account multi peaked distributions better, especially with a large h (look into kernel theory to find out what that means) .
We also experimented with trying modified PDF generators that used e.g an increased variance, some of these had success.
With this knowledge you can choose a good PDF generator or perhaps mix of PDF's for any system of equations.
Suggested procedure
The validity of these generators will depend on what sort of curves your constants generate, therefore in general, to test you must use a code like “Emporer” (name in github). And find these validity conditions out for yourself.
Density bins
A simpler alternative that requires a large and high density data set to be valid is to use only the data points you have, making each sequential gap between data points a bin. Next pick bins for sampling proportional to the density of data points so a constant divided by width of bin. This can then be normalised and finished like explained in how it's done-implementation above.