Difference between revisions of "Team:Manchester/Model/PDF"

 
(41 intermediate revisions by 3 users not shown)
Line 4: Line 4:
  
 
<style>
 
<style>
 +
.title11{
 +
    color:black;
 +
}
 +
 
.width45{
 
.width45{
 
     width:45%;
 
     width:45%;
Line 82: Line 86:
 
</style>
 
</style>
  
<h1 class="title11">PDF (Probability Density Function)</h1>
+
<h1 id="TopTitle" class="title11">Probability Density Functions</h1>
  
 
<div class="team">
 
<div class="team">
  
<p class="title2" style="font-size:25px;">What is a PDF?</p>
 
<p style="font-size:17px;">
 
First a piece of jargon, a pdf or probability density function. Is something that describes a distribution e.g the well known bell curve for a normal distribution. If you were to select a range of values the distribution could select the area of the curve between the min and max is proportional to the probability the value would be in this range.
 
</p>
 
 
 
<p class="title2" style="font-size:25px;">Set the need</p>
 
  
  
 
<p style="font-size:17px;">
 
<p style="font-size:17px;">
So we know have a data set for the parameters and we want to pick them in a smart way for running in the model. This is done by turning the data into a pdf. See diagrams for a schematic of the process.
+
We now have a <a href="https://2016.igem.org/Team:Manchester/Model/ParameterSelection">collection of literature values</a> for all parameters and their associated uncertainty. Now we want to sample plausible values for use in our ensemble simulations. Of course, values could be randomly picked from the found literature data, but what about the ‘gaps’ between the data points? These intermediate values are perfectly plausible parameter values too.
<br /><br />
+
</br></br>
Values could be randomly picked from the found parameters, but what about the ‘gaps’ between the points? These could be perfectly valid potential values too.
+
The Probability Density Functions (PDF) describe our beliefs about the plausibility of different possible parameter values in a systematic way, and can be used for sampling continuously from the entire range of plausible values.
<br /><br />
+
Converting a discrete dataset into a continuous one allows for better precision and makes sure your not missing any interesting parameter range, this continuous one is our pdf.
+
 
</p>
 
</p>
  
  
<p class="title2" style="font-size:25px;">What are the options Allude to advantages and disadvantages</p>
+
 
 +
<p id = "heading3" class="title2" style="font-size:25px;">What are the options?</p>
  
  
 
<p style="font-size:17px;">
 
<p style="font-size:17px;">
here are multiple ways to create a pdf from a data set. They can be split into two categories; parametric and nonparametric . Parametric meaning the data is used to find parameters for a known distribution that will fit the data the best. We tried two Parametric methods fitting to a log normal distribution and fitting to a normal distribution.
+
There are multiple ways to generate a probability density function from a discrete dataset. They can be split into two categories; parametric and non-parametric. In parametric approaches, the data are used to find parameters for a known distribution that fits the data the best. We tried two parametric methods: fitting to a log-normal distribution and fitting to a normal distribution.
 
<br /><br />
 
<br /><br />
Non parametric methods don’t use knowledge of a known distribution, we tried one such method “kernels” this works by giving each data point an associated wave the superposition of which gives your pdf.
+
Non-parametric methods don’t use knowledge of a specific distribution; we tried one such method, kernel density estimation. This works by giving each data point an associated wave, the superposition of which gives our PDF.
<br /><br />
+
All these methods have advantages and drawbacks which will be discussed in the code section.
+
 
</p>
 
</p>
 +
</br>
 +
</br>
  
<p class="title2" style="font-size:25px;">Graphs to aid understanding</p>
+
<p id="heading4" class="title2" style="font-size:25px;">Example of a PDF we used</p>
  
  
 
<p style="font-size:17px;">
 
<p style="font-size:17px;">
The graphs are displaying samples from a pdf used the lognormal method. As you can see from the histograms it has given much more information about the densely populated 0-100 Km value range, however it has unfairly reduced the amount of  Km = ~ 500 and ~ 700 values. This is a potential downside of parametric methods and can be remedied by the kernel method. However looking at the ascending order values these Km values are still present in the sample of 500.
+
The below graph is the PDF for K<sub>eq</sub> GOx. The mean, median and mode are highlighted to demonstrate the high skew of a log normal distribution.
 +
 
 
</p>
 
</p>
  
 
<center>
 
<center>
<img class="width50" src="https://static.igem.org/mediawiki/2016/b/b8/T--Manchester--modelling_graph_4.png" alt="graph 1" />
+
<img class="width50" src="https://static.igem.org/mediawiki/2016/3/37/T--Manchester--PDFmodemeanmedian.jpg" alt="graph 1" />
 
</center>
 
</center>
  
<center>
 
<img class="width50" src="https://static.igem.org/mediawiki/2016/f/f9/T--Manchester--modelling_graph_5.png" alt="graph 2" />
 
</center>
 
  
<center>
 
<img class="width50" src="https://static.igem.org/mediawiki/2016/a/a7/T--Manchester--modelling_graph_6.png" alt="graph 3" />
 
</center>
 
  
 +
</br>
  
<p class="title3" style="font-size:25px;">How it’s done - Finding function</p>
+
<p id = "heading5" class="title3" style="font-size:25px;">PDF equations</p>
 +
 
 +
<p>
 +
For the parametric methods the distribution parameters are calculated for our data set, these are then put into the relevant equation. For the <a href="https://en.wikipedia.org/wiki/Kernel_(statistics)" target="_blank">kernel method</a>, the Epanechnikov kernel was used for the waves associated with each point. This was not used further during our project as the non-parametric log normal distribution was used instead following discussion and debate with our supervisors. </br>To calculate a log normal distribution you must evaluate the following function:
  
<p style="font-size:17px;">
 
Relevant github link. All files discussed here are available for reference
 
<br /><br />
 
Recall from the theory section we tried the lognormal distribution
 
Firstly the distribution function must be calculate. For the parametric methods the standard deviation and the mean are calculated from the data set, these are then put into the relevant equation. For the kernel method, the epanechnikov kernel was used for the waves associated with each point.
 
 
</p>
 
</p>
 +
$$f(x) = \frac{1}{x\sigma\sqrt{2\pi}} e^{\left[{\frac{-(\ln{x}-\mu)^2}{2\sigma^2}}\right]}  $$
 +
<p>where:</p>
 +
<table>
 +
<th>Symbol
 +
<th>Meaning
 +
<tr>
 +
<td>$$\sigma$$</td>
 +
<td>Log of standard deviation of the dataset</td>
 +
</tr>
 +
<tr>
 +
<td>$$\mu$$</td>
 +
<td>Log of the mean of the dataset</td>
 +
</tr>
 +
</table>
 +
</br>
  
 +
<h1 class="title11">Practical considerations</h1>
  
<p class="title3" style="font-size:25px;">How it’s done - </p>
+
<p id = "heading6" class="title3" style="font-size:25px;">How we sampled from our distributions </p>
  
  
 
<p style="font-size:17px;">
 
<p style="font-size:17px;">
Bins are created (The more the better as long as you have data to fill them while having good statistics.) The function is evaluated at the start position of each of these bins and the cumulative integral of the function is found. With this cumulative function returning the value of the integral at each bin start point in a array. These values are then all normalised so that the final value is equal to 1. As such the difference between two sequential values is equal to the probability that a value lies in that range.
+
The PDF was split up into bins. The PDF is evaluated at the start position of each of these bins and the cumulative integral: </p>
 +
$$\int_0^n{f(x)} dx$$
 +
<p style="font-size:17px;">
 +
Was found at the end of each bin. These values are then all normalised so that the final value is equal to 1. As such the difference between two sequential bins cumulative integral is equal to the probability that a randomly picked parameter will be in the bin defined by the two start points.
 
<br /><br />
 
<br /><br />
A random number between 0 and 1 is now generated , this is used to select which bin to sample from using the knowledge from the previous paragraph.  
+
A random number between 0 and 1 is now generated, This is compared to the cumulative integral to decide which bin to sample from.
<br /><br />
+
</p>
Finally a newly generated separate random number between 0 and 1 is used to decide where in the bin the value will be, to illustrate if a bins start point was 3 and had a width of 2 if you get a random number 0.5 your value would be 4.  
+
<img class="width50" src="https://static.igem.org/mediawiki/2016/1/1c/T--Manchester--PDFexplain.jpg" alt="graph 2" />
 
+
<p></p>
  
 +
<p style="font-size:17px;">
 +
<br /><br />
 +
Finally we know what bin to sample from but we do not yet know where in the bin to sample. It is assumed that the PDF is constant over the bin such that a newly generated separate random number between 0 and 1, r, is used to decide where in the bin the value will, assuming all values are equally likely. </p>
 +
$$ x_{sampled} = A + r(B-A)$$
 +
<p style="font-size:17px;">
 +
If a random number, r, was chosen such that Its value is between the cumulative integral at B and the cumulative integral between A then the value will be in that bin. </p>
 +
$$\int_0^A{f(x)} dx < r < \int_0^B{f(x)} dx$$
 +
</br></br>
 +
<p style="font-size:17px;">
 +
Of course the bins in the diagram are much smaller in the actual simulation so the assumption all values in a bin are equally likely is valid. Ideally bin size tends to zero however this adds computational cost for a negligible increase in precision.
 
</p>
 
</p>
 +
</br></br>
  
  
<p class="title3" style="font-size:25px;">Noticed fail cases assessment of methods</p>
+
</br>
 +
<p id = "heading9"  class="title3" style="font-size:25px;">Final Notes</p>
  
 
<p style="font-size:17px;">
 
<p style="font-size:17px;">
Relevant github link. All files discussed here are available for reference
+
1) The psuedo  code for our sampling method is detailed below.
<br /><br />
+
As was mentioned briefly in the theory section each method has some drawbacks. In the following it is important to note that enzyme rate constants are distributed log-normally what I really mean by this is the rate constant has one true value but variation in it log normally will produce similar end results . Also that we tested these methods by using a self iterating code that updated guessed constants by fitting to data (an explanation of this code won’t be given in text but contact us for details.), in this the data was manufactured to have come from some specific parameters. But the code didn’t know about those. Note we had to take into account sloppy parameters etc, see parameter analysis section. This code as such should have evolved to have the manufactured parameters.
+
<br /><br />
+
The gaussian method was generally very good in that it would always evolve towards the manufactured constants initially, however when close it would never make it equilibrating a good distance away. Also it overshoots if coming from below zero with a high deviation in the tried constants.  
+
<br /><br />
+
Log normal method generally gets very close (within a couple %) to the actual parameters but can often overshoot.
+
Kernel method provides even better fits but can go completely crazy if the data set is too spread, but does take into account multi peaked distributions better, especially with a large h (look into kernel theory to find out what that means) .
+
 
+
We also experimented with trying modified pdf generators that used e.g an increased variance, some of these had success.
+
<br /><br />
+
With this knowledge you can choose a good pdf generator or perhaps mix of generators for any data case
+
 
</p>
 
</p>
 +
</br><br>
  
 +
y = evaluation of PDF at start of all bins.;</br></br>
 +
yi = cumtrapz(y); "yi is a variable that stores all the cumalitive areas under the graph at the start of each bin" </br></br>
 +
yi = yi./max(yi); "normalise so that the total area under curve is one so now area is equal to probability" </br></br>
 +
for i = 1:BinNumber %%loop through all the Bins
 +
<p style="margin-left: 40px">bin_areas(i) = yi(i+1) - yi(i); "evaluate the probability of the bins containing a given random data point by evaluating the normalised cumulative integral at the start and end point of a bin. "
 +
end
 +
   
 +
for j = 1:Iterations "loop for the number of points you want to sample"
  
<p class="title3" style="font-size:25px;">Suggested procedure</p>
+
<p style="margin-left: 80px">r = rand(); "generate a random number uniformly distributed between 0 and 1."</p>
 
+
<p style="margin-left: 80px">while "keep doing until found sampled data points bin"</p>
 
+
<p style="margin-left: 120px">if semisum > r "if greater than random number you have selected the correct bin. store this j will now increment"</p>
<p style="font-size:17px;">
+
<p style="margin-left: 120px">else</p>
The validity of these generators will depend on what sort of curves your constants generate, therefore in general, to test you must use a code like “Emporer” (name in github). And find these validity conditions out for yourself.
+
<p style="margin-left: 160px">semisum = semisum + bin_areas( next one); "keep going through bins moving back up the cumalitive integral in bin steps"</p>
 +
<p style="margin-left: 120px">end</p>
 +
<p style="margin-left: 80px">end</p>
 +
<p style="margin-left: 40px">end</p>
 +
<p style="margin-left: 0px">end</p>
 +
</br>
 +
<p>
 +
</br></br>
 +
2) A simpler alternative to PDFs that requires a large and high density data set to be valid is to use only the data points you have, making each sequential gap between data points a bin. Next pick bins for sampling proportional to the density of data points so a constant divided by width of bin. This can then be normalised and finished like explained above for sampling our distribution.
 
</p>
 
</p>
  
<p class="title3" style="font-size:25px;">Density bins</p>
+
</br>
 +
<a href="https://2016.igem.org/Team:Manchester/Model">Return to overview</a>
 +
</br>
 +
<a href="https://2016.igem.org/Team:Manchester/Model/Simulate">Continue to How we simulated our system</a>
 +
</br>
  
<p style="font-size:17px;">
 
A simpler alternative that will require a large and high density data set to be valid is to use only the data points you have make each sequential gap between data points a bin and pick bins for sampling proportional to the density of data points i.e 1 divided by width of bin would be a good measure. This can then be normalised to 1 and finished like it’s explained in how it's done-implementation. 
 
</p>
 
  
<span class="box1"></span>
+
<span class="box"></span>
  
  

Latest revision as of 02:42, 20 October 2016

Manchester iGEM 2016

Probability Density Functions

We now have a collection of literature values for all parameters and their associated uncertainty. Now we want to sample plausible values for use in our ensemble simulations. Of course, values could be randomly picked from the found literature data, but what about the ‘gaps’ between the data points? These intermediate values are perfectly plausible parameter values too.

The Probability Density Functions (PDF) describe our beliefs about the plausibility of different possible parameter values in a systematic way, and can be used for sampling continuously from the entire range of plausible values.

What are the options?

There are multiple ways to generate a probability density function from a discrete dataset. They can be split into two categories; parametric and non-parametric. In parametric approaches, the data are used to find parameters for a known distribution that fits the data the best. We tried two parametric methods: fitting to a log-normal distribution and fitting to a normal distribution.

Non-parametric methods don’t use knowledge of a specific distribution; we tried one such method, kernel density estimation. This works by giving each data point an associated wave, the superposition of which gives our PDF.



Example of a PDF we used

The below graph is the PDF for Keq GOx. The mean, median and mode are highlighted to demonstrate the high skew of a log normal distribution.

graph 1

PDF equations

For the parametric methods the distribution parameters are calculated for our data set, these are then put into the relevant equation. For the kernel method, the Epanechnikov kernel was used for the waves associated with each point. This was not used further during our project as the non-parametric log normal distribution was used instead following discussion and debate with our supervisors.
To calculate a log normal distribution you must evaluate the following function:

$$f(x) = \frac{1}{x\sigma\sqrt{2\pi}} e^{\left[{\frac{-(\ln{x}-\mu)^2}{2\sigma^2}}\right]} $$

where:

Symbol Meaning
$$\sigma$$ Log of standard deviation of the dataset
$$\mu$$ Log of the mean of the dataset

Practical considerations

How we sampled from our distributions

The PDF was split up into bins. The PDF is evaluated at the start position of each of these bins and the cumulative integral:

$$\int_0^n{f(x)} dx$$

Was found at the end of each bin. These values are then all normalised so that the final value is equal to 1. As such the difference between two sequential bins cumulative integral is equal to the probability that a randomly picked parameter will be in the bin defined by the two start points.

A random number between 0 and 1 is now generated, This is compared to the cumulative integral to decide which bin to sample from.

graph 2



Finally we know what bin to sample from but we do not yet know where in the bin to sample. It is assumed that the PDF is constant over the bin such that a newly generated separate random number between 0 and 1, r, is used to decide where in the bin the value will, assuming all values are equally likely.

$$ x_{sampled} = A + r(B-A)$$

If a random number, r, was chosen such that Its value is between the cumulative integral at B and the cumulative integral between A then the value will be in that bin.

$$\int_0^A{f(x)} dx < r < \int_0^B{f(x)} dx$$

Of course the bins in the diagram are much smaller in the actual simulation so the assumption all values in a bin are equally likely is valid. Ideally bin size tends to zero however this adds computational cost for a negligible increase in precision.




Final Notes

1) The psuedo code for our sampling method is detailed below.



y = evaluation of PDF at start of all bins.;

yi = cumtrapz(y); "yi is a variable that stores all the cumalitive areas under the graph at the start of each bin"

yi = yi./max(yi); "normalise so that the total area under curve is one so now area is equal to probability"

for i = 1:BinNumber %%loop through all the Bins

bin_areas(i) = yi(i+1) - yi(i); "evaluate the probability of the bins containing a given random data point by evaluating the normalised cumulative integral at the start and end point of a bin. " end for j = 1:Iterations "loop for the number of points you want to sample"

r = rand(); "generate a random number uniformly distributed between 0 and 1."

while "keep doing until found sampled data points bin"

if semisum > r "if greater than random number you have selected the correct bin. store this j will now increment"

else

semisum = semisum + bin_areas( next one); "keep going through bins moving back up the cumalitive integral in bin steps"

end

end

end

end




2) A simpler alternative to PDFs that requires a large and high density data set to be valid is to use only the data points you have, making each sequential gap between data points a bin. Next pick bins for sampling proportional to the density of data points so a constant divided by width of bin. This can then be normalised and finished like explained above for sampling our distribution.


Return to overview
Continue to How we simulated our system