Difference between revisions of "Team:Manchester/Model/PDF"

Line 113: Line 113:
 
<p id="heading4" class="title2" style="font-size:25px;">Example of a PDF we used</p>
 
<p id="heading4" class="title2" style="font-size:25px;">Example of a PDF we used</p>
  
<p> PLACEHOLDER FOR TOBYS GRAPH! </p>
+
 
 
<p style="font-size:17px;">
 
<p style="font-size:17px;">
These graphs show the difference between the raw data and data sampled from a created PDF.<br /><br />
 
  
The graphs are displaying samples from a pdf used the lognormal method. As you can see from the histograms it has given much more information about the densely populated 0-100 Km value range, however it has unfairly reduced the amount of  Km = ~ 500 and ~ 700 values. This is a potential downside of parametric methods and can be remedied by the kernel method. However looking at the ascending order values these Km values are still present in the sample of 500.
 
 
</p>
 
</p>
  
Line 174: Line 172:
 
Finally we know what bin to sample from but we dont yet know where in the bin too sample. Its assumed that the pdf is constant over the bin such that a newly generated separate random number between 0 and 1 is used to decide where in the bin the value will, assuming all values are equally likely. sampled value = value at start of bin + bin width x random number(0-1).  
 
Finally we know what bin to sample from but we dont yet know where in the bin too sample. Its assumed that the pdf is constant over the bin such that a newly generated separate random number between 0 and 1 is used to decide where in the bin the value will, assuming all values are equally likely. sampled value = value at start of bin + bin width x random number(0-1).  
  
 
+
The below figure is an example of a PDF that has been split into bins. The cumulative integral is between at any point on the x axis is between zero and one, and increases from left to right. If A random number x was chosen such that Its value is between the cumulative integral at B and the cumulative integral between A then the value will be in that bin.
 +
</br></br>
 +
Of course the bins in the diagram are much smaller in the actual simulation so the assumption all values in a bin are equally likely is valid.
 +
<img class="width50" src="https://static.igem.org/mediawiki/2016/1/1c/T--Manchester--PDFexplain.jpg" alt="graph 2" />
 
</p>
 
</p>
  

Revision as of 18:37, 19 October 2016

Manchester iGEM 2016

Probability Density Functions

We now have a collection of literature values for all parameters and their associated uncertainty. Now we want to sample plausible values for use in our ensemble models. Of course, values could be randomly picked from the found literature data, but what about the ‘gaps’ between the data points? These intermediate values could be perfectly plausible parameter values too.

The Probability Density Functions (PDF) describe our beliefs about the plausibility of different possible parameter values in a systematic way, and can be used for sampling continuously from the entire range of plausible values.

What are the options?

There are multiple ways to estimate a probability density function from a data set. They can be split into two categories; parametric and non-parametric. In parametric approaches, the data are used to find parameters for a known distribution that fits the data the best. We tried two parametric methods: fitting to a log-normal distribution and fitting to a normal distribution.

Non-parametric methods don’t use knowledge of a specific distribution; we tried one such method, kernel density estimation. This works by giving each data point an associated wave, the superposition of which gives our PDF.



Example of a PDF we used

graph 1
graph 2
graph 3
graph 4

PDF equations

For the parametric methods the distribution parameters are calculated for our data set, these are then put into the relevant equation. For the kernel method, the Epanechnikov kernel was used for the waves associated with each point.

For example to find the log-normal distribution you must evaluate the following function:

$$y = \frac{1}{x\sigma\sqrt{2\pi}} e^{\left[{\frac{-(\ln{x}-\mu)^2}{2\sigma^2}}\right]} $$

where:

Symbol Meaning
$$\sigma$$ Standard Deviation of the dataset
$$\mu$$ Mean of the dataset

Practical considerations

How we sampled from our distributions

The PDF was split up into bins. The PDF is evaluated at the start position of each of these bins and the cumulative integral (total area under the PDF from parameter = 0) Was found at the end of each bin. These values are then all normalised so that the final value is equal to 1. As such the difference between two sequential bins cumulative integral is equal to the probability that a randomly picked parameter will be in the bin defined by the two start points.

A random number between 0 and 1 is now generated, This is compared to the cumulative integral to decide which bin to sample from.(see below figure).

Finally we know what bin to sample from but we dont yet know where in the bin too sample. Its assumed that the pdf is constant over the bin such that a newly generated separate random number between 0 and 1 is used to decide where in the bin the value will, assuming all values are equally likely. sampled value = value at start of bin + bin width x random number(0-1). The below figure is an example of a PDF that has been split into bins. The cumulative integral is between at any point on the x axis is between zero and one, and increases from left to right. If A random number x was chosen such that Its value is between the cumulative integral at B and the cumulative integral between A then the value will be in that bin.

Of course the bins in the diagram are much smaller in the actual simulation so the assumption all values in a bin are equally likely is valid. graph 2


Final Note





A simpler alternative that requires a large and high density data set to be valid is to use only the data points you have, making each sequential gap between data points a bin. Next pick bins for sampling proportional to the density of data points so a constant divided by width of bin. This can then be normalised and finished like explained in how we sampled our distributions.


Return to overview
Return to overview