Probability Density Functions
We now have a collection of literature values for all parameters and their associated uncertainty. Now we want to sample plausible values for use in our ensemble simulations. Of course, values could be randomly picked from the found literature data, but what about the ‘gaps’ between the data points? These intermediate values are perfectly plausible parameter values too. The Probability Density Functions (PDF) describe our beliefs about the plausibility of different possible parameter values in a systematic way, and can be used for sampling continuously from the entire range of plausible values.
What are the options?
There are multiple ways to generate a probability density function from a discrete data set. They can be split into two categories; parametric and non-parametric. In parametric approaches, the data are used to find parameters for a known distribution that fits the data the best. We tried two parametric methods: fitting to a log-normal distribution and fitting to a normal distribution.
Non-parametric methods don’t use knowledge of a specific distribution; we tried one such method, kernel density estimation. This works by giving each data point an associated wave, the superposition of which gives our PDF.
Example of a PDF we used
The below graph is the PDF for Keq GOx. The mean, median and mode are highlighted to demonstrate the high skew of a log normal distribution.
PDF equations
For the parametric methods the distribution parameters are calculated for our data set, these are then put into the relevant equation. For the kernel method, the Epanechnikov kernel was used for the waves associated with each point. This was not used further during our project as the non-parametric log normal distribution was used instead following discussion and debate with our supervisors. To calculate a log normal distribution you must evaluate the following function:
$$f(x) = \frac{1}{x\sigma\sqrt{2\pi}} e^{\left[{\frac{-(\ln{x}-\mu)^2}{2\sigma^2}}\right]} $$where:
Symbol | Meaning |
---|---|
$$\sigma$$ | Standard Deviation of the dataset |
$$\mu$$ | Mean of the dataset |
Practical considerations
How we sampled from our distributions
The PDF was split up into bins. The PDF is evaluated at the start position of each of these bins and the cumulative integral:
$$\int_0^n{f(x)} dx$$
Was found at the end of each bin. These values are then all normalised so that the final value is equal to 1. As such the difference between two sequential bins cumulative integral is equal to the probability that a randomly picked parameter will be in the bin defined by the two start points.
A random number between 0 and 1 is now generated, This is compared to the cumulative integral to decide which bin to sample from.
Finally we know what bin to sample from but we dont yet know where in the bin too sample. Its assumed that the PDF is constant over the bin such that a newly generated separate random number between 0 and 1, r, is used to decide where in the bin the value will, assuming all values are equally likely. sampled
If a random number,r, was chosen such that Its value is between the cumulative integral at B and the cumulative integral between A then the value will be in that bin.
$$\int_0^A{f(x)} dx < r < \int_0^B{f(x)} dx$$ Of course the bins in the diagram are much smaller in the actual simulation so the assumption all values in a bin are equally likely is valid. Ideally bin size tends to zero however this adds computational cost for a negligible increase in precision.Final Notes
1) The psuedo code for our sampling method is detailed below.
y = evaluation of PDF at start of all bins.; yi = cumtrapz(y); "yi is a variable that stores all the cumalitive areas under the graph at the start of each bin" yi = yi./max(yi); "normalise so that the total area under curve is one so now area is equal to probability" for i = 1:BinNumber %%loop through all the Bins
bin_areas(i) = yi(i+1) - yi(i); "evaluate the probability of the bins containing a given random data point by evaluating the normalised cumulative integral at the start and end point of a bin. " end for j = 1:Iterations "loop for the number of points you want to sample"
r = rand(); "generate a random number uniformly distributed between 0 and 1."
while "keep doing until found sampled data points bin"
if semisum > r "if greater than random number you have selected the correct bin. store this j will now increment"
else
semisum = semisum + bin_areas( next one); "keep going through bins moving back up the cumalitive integral in bin steps"
end
end
end
end
2) A simpler alternative to PDFs that requires a large and high density data set to be valid is to use only the data points you have, making each sequential gap between data points a bin. Next pick bins for sampling proportional to the density of data points so a constant divided by width of bin. This can then be normalised and finished like explained in how we sampled our distributions.
Return to overview Continue to How we simulated our system