Difference between revisions of "Team:Wageningen UR/Software"

 
(26 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
<h4><a href="#header">Toxin Scanner</a></h4>
 
<h4><a href="#header">Toxin Scanner</a></h4>
 
<li class="menu-item">
 
<li class="menu-item">
<a href="#TS_intro">Introduction</a>
+
<a href="#TS_intro">Discovery & Specificity</a>
 +
</li>
 +
<li class="menu-item">
 +
<a href="#TS_validation">Testing the Tool</a>
 +
</li> 
 
<li class="menu-item">
 
<li class="menu-item">
<a href="#TS_methods">Methods</a>
+
<a href="#TS_methods">Software Description</a>
 
</li>   
 
</li>   
 
<li class="menu-item">
 
<li class="menu-item">
<a href="#TS_results">Results</a>
+
<a href="#TS_results"><i>Varroa</i> Isolate</a>
 
</li>   
 
</li>   
 
<li class="menu-item">
 
<li class="menu-item">
<a href="#TS_discussion">Discussion</a>
+
<a href="#references">References</a>
 
</li>  
 
</li>  
 
</div>
 
</div>
Line 20: Line 24:
 
<html>
 
<html>
 
<section id="TS_intro">
 
<section id="TS_intro">
<h1>Toxin Scanner</h1>
+
<h1 style="text-align:center;">Toxin Scanner</h1>
<h3>iGEM and sequencing</h3>     
+
<h3>BioBrick discovery</h3>     
<p>For iGEM 2016 we designed a high-throughput pipeline for the identification of novel proteins directly from raw genome sequencing data in fasta or fastq format. Given the specificity of our tool and the importance of biobrick discovery in iGEM, we made this tool publicly available for everyone to modify and use. It can be found in this <a href=https://gitlab.com/rphdejongh/bioinformatics.git> Gitlab repository </a>.  
+
<p>For iGEM 2016 we designed a high-throughput pipeline for the identification of novel proteins directly from raw genome sequencing data. Given the specificity of our tool and the importance of biobrick discovery in iGEM, we made it publicly available for everyone to modify and use. The tool can be found in this <a href=https://gitlab.com/rphdejongh/bioinformatics.git style="padding-right:0px;"> Gitlab repository </a>.  
 +
<br/>
 +
For the purpose of the BeeT project, we use it as a cry toxin predictor, given genomes of selected bacteria. </p>
 +
<h3>Toxin Specificity</h3>
 +
<p>For this project we need a toxin that specifically targets <i>Varroa destructor</i>. The most well known miticidal proteins are the crystal (Cry) proteins. These are usually found on megaplasmids from <i>Bacillus Thuringiensis</i> and related species. <sup><a href="#ts1" id="ref_ts1">1</a></sup> </p>
 +
<p>We know <i>Varroa</i>-specific miticidal activity exists in <i>Bacillus thuringiensis</i> and related species, as shown in: "In vitro susceptibility of <i>Varroa destructor </i>and <i>Apis mellifera</i> to native strains of <i>Bacillus thuringiensis</i>."  by Alquisira-Ramírez et al. <sup><a href="#ts2" id="ref_ts2">2</a></sup> In this paper, several isolates are described that cause a mite mortality of up to 100%. Importantly, the strains also showed no miticidal activity against bee larvae. Because of this we started several sub-projects in parallel to maximize our chances of finding a viable <i>V. destructor</i>-killer. The <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity>specificity</a>
 +
part of our project focuses on  <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#ToxinEngineering> creating </a> <i>V. destructor</i>-gut binding Cry toxins and <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates2> finding </a> <i>V. destructor</i>-specific miticidal proteins. </p>
 +
<p>There already exists a publication about a tool called “Bt Toxin Scanner”<sup><a href="#ts3" id="ref_ts3">3</a></sup>. This tool does not fully support local deployment, which is needed for high-throughput analysis. Also, because of the relatively basic analysis done by the tool, we decided to develop our own tool that is fully open-source and improves upon the analysis techniques used in Bt Toxin Scanner. Our goal with this tool is to run raw sequencing files, and deliver potential Cry proteins with just the click of a button.</p>
 +
<p>This tool was made in preparation for results of the latter, finding <i>V. destructor</i>-specific miticidal proteins, which we assume to be of the Cry protein family.
 +
These Cry proteins are a diverse group, but are known to be highly specific for individual insects, acari, nematodes and various other eukaryotic taxa. Cry proteins are not necessarily a group of proteins that all perform the same function in the same manner. The distinction between Cry and non-Cry proteins is defined by a committee: <a href=http://www.lifesci.sussex.ac.uk/home/Neil_Crickmore/Bt/>Cry Protein website</a> Based on 45% sequence similarity there are over 70 groups. This high amount of diversity makes it hard to predict when something is or isn't a Cry protein. Despite this diversity, many of them have the same three domain structure. The N-terminal domain I is involved in membrane insertion and pore formation, while domains II and III are involved in receptor recognition and binding to them.</p>
 +
</section>
  
For the purpose of the BeeT project, we use it as a cry toxin (oid) predictor, given genomes of selected bacteria. With some modifications it has the potential to be a more broadly applicable tool for finding specific genes in newly sequenced genomes. </p>
+
<section id="TS_validation">
<h3>Background</h3>
+
<h1>Testing the tool</h1>
<p>For this project we need a toxin specific to <i>Varroa destructor</i>. There are many insecticidal protein families, but the one we were most interested in are the "crystal" (or "cry") proteins. These are usually found on plasmids found in <i>Bacillus Thuringiensis</i> and related species. <sup><a href="#ts14" id="ref_ts14">14</a></sup>  
+
<p> We tested the tool on a genome sample from a study with accession number <a href=http://www.ebi.ac.uk/ena/data/view/PRJEB5931> PRJEB5931</a>. <sup><a href="#ts4" id="ref_ts4">4</a></sup> This genome was found after a co-evolution experiment, and a <i>Bacillus thuringiensis</i> with known nematicidal Cry proteins present.  
 +
<br/>
 +
<a href=http://www.ebi.ac.uk/ena/data/view/ERX463573>ERX463573</a> is the accession code of the experiment from which these raw read files came. The experiment this came from was about a Population of <i>Bacillus thuringiensis</i> which were coevolved with <i>Caenorhabditis</i>, which is a kind of nematode, as host.
 +
 
 +
From the study we know to expect at the very least the following two proteins:
 +
<a href=https://www.ncbi.nlm.nih.gov/nucleotide/47500285> Cry35Aa4 </a>
 +
and
 +
<a href=https://www.ncbi.nlm.nih.gov/nucleotide/2724454>Cry21Aa2</a>. <br/>
 +
According to the Cry protein toxin list it is known that Cry35Aa4 is a binary toxin with <a href=https://www.ncbi.nlm.nih.gov/nucleotide/47500295> Cry34Aa4 </a>, and as such we may expect to find this protein, or one like it, as well.
  
  
We know <i>Varroa</i>-specific insecticidal activity exists in <i>Bacillus thuringiensis</i> and related species, due to a paper called: "In vitro susceptibility of <i>Varroa destructor </i>and <i>Apis mellifera</i> to native strains of <i>Bacillus thuringiensis</i>."  by Alquisira-Ramírez et al. <sup><a href="#ts2" id="ref_ts2">2</a></sup> In this paper several isolates are described that cause a mite mortality of up to 100%. Importantly the strains also showed no insecticidal activity against bee larvae. Because of this we started several sub-projects in parallel to maximize our chances of finding a viable <i>V. destructor</i>-killer. The <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity>specificity</a>
+
</p><h2>Visualizing the result</h2><p>
part of our project focuses on  <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#ToxinEngineering> creating </a> <i>V. destructor</i>-binding cry toxins and <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates2> finding </a> <i>V. destructor</i>-specific insecticidal proteins.
+
The easiest way to visualize the results from the three separate components of the tool is to use a <a href=http://bioinformatics.psb.ugent.be/webtools/Venn/>Venn diagram</a>.
<br>
+
This tool was made in preparation for results of the latter, finding <i>V. destructor</i>-specific insecticidal proteins, which we assume to be of the cry protein family.
+
These cry proteins are a diverse group, but are known to be highly specific for individual insects and nematodes. Cry proteins are not necessarily a group of proteins that all perform the same function in the same manner. The distinction between cry and non-cry proteins is defined by a committee: <a href=http://www.lifesci.sussex.ac.uk/home/Neil_Crickmore/Bt/>Cry Protein website</a> Based on 45% sequence similarity there are over 70 groups, this high amount of diversity makes it hard to predict when something is or isn't a cry protein.  
+
  
<h3>Current developments</h3>
+
<figure><img src="https://static.igem.org/mediawiki/2016/2/2c/T--Wageningen_UR--Venn-Diagram_TS.png"/><figcaption>Figure 1: A Venn diagram showing the overlap of the output of the various methods used to predict whether or not certain genes from the genome are Cry proteins or not.</figcaption></figure>
<p>There already exists a publication about a tool called “Bt Toxin Scanner”<sup><a href="#ts1" id="ref_ts1">1</a></sup>. This tool does not fully support local deployment, which is needed for high-throughput analysis, and the relatively basic analysis done by the tool. Due to these reasons we decided to develop our own tool that is fully open-source and improves upon the Bt Toxin Scanner. Our goal with this tool is to run raw sequencing files, and deliver potential cry proteins with just the click of a button.</p>
+
</section>
+
  
 +
<p>As Blast had only 5 results it is easier to examine this method in detail. Two of the five were indeed proteins we expected from the paper: Cry35Aa4 and Cry21Aa2. Both of these were also picked up by the Hidden Markov Model method of cry protein detection as shown in figure 1, but not by the RandomForest method. <br/>
 +
Next we found a protein that matched very well with the Cry34Aa group, which is a complimentary protein to Cry35Aa4.<br/>
 +
The two other proteins which were found: Cry38Aa1, which has no known insecticidal target, but is highly similar to proteins that do: Cry15Aa1, Cry23Aa1, and Cry33Aa1.  <sup><a href="#ts5" id="ref_ts5">5</a></sup><br/>
 +
and "Gene_5518" which was 85.87% identical to: Cry14Ab1. This protein is only mentioned in a patent by Sampson et al. 2012.<sup><a href="#ts6" id="ref_ts6">6</a></sup> This protein is quite interesting because it might be a completely new protein or a variation of Cry14Ab1 specific to nematodes.
 +
 +
</section>
 
<section id="TS_methods">
 
<section id="TS_methods">
<h1>Methods</h1>
+
<h1>Tool overview</h1>
<h3>Overview</h3><p>
+
<p>
We use a combination of existing tools to come to the prediction of novel cry proteins. The entire pipeline consists of four scripts in total, one of which is entirely dedicated to analysis of the Random Forest model and not further used in the main program. The others are there to group the Machine Learning specific functions, the functions that handle known cry proteins, and the functions that handle raw sequence data. Figure 1 gives a graphical representation of the pipeline, though some modules have been left out for the sake of readabilit. Here, we shall go through each part of the process in a step-by-step manner.
+
We use a combination of existing tools to come to the prediction of novel cry proteins. The entire pipeline consists of four scripts in total, one of which is entirely dedicated to analysis of the Random Forest model and not further used in the main program. The others are there to group the Machine Learning specific functions, the functions that handle known cry proteins, and the functions that handle raw sequence data. Figure 2 gives a graphical representation of the pipeline, though some modules have been left out for the sake of readability. Here, we go through each part of the process in a step-by-step manner.</p>
 +
<map name="pipeline-map" id="pipeline-map">
 +
    <area alt="idba_ud" title="" href="http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/" shape="poly" coords="663,10,744,89,663,165,584,88" />
 +
    <area alt="genemark" title="" href="http://exon.gatech.edu/GeneMark/" shape="poly" coords="664,247,742,324,663,402,587,325" />
 +
    <area alt="Cry protein" title="" href="http://www.lifesci.sussex.ac.uk/home/Neil_Crickmore/Bt/" shape="poly" coords="67,231,217,232,218,352,68,351" />
 +
    <area alt="hmmscan" title="" href="http://www.hmmr.org" shape="poly" coords="492,738,573,817,493,895,416,821" />
 +
    <area alt="blastp" title="" href="https://www.ncbi.nlm.nih.gov/pubmed/2231712" shape="poly" coords="307,748,382,821,305,899,229,823" />
 +
    <area alt="RandomForest" title="" href="http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" shape="poly" coords="115,747,193,826,115,902,39,827" />
 +
</map>
 +
<figure><img src=https://static.igem.org/mediawiki/2016/0/08/T--Wageningen_UR--pipeline_overview_TS.png usemap="#pipeline-map"/><figcaption>Figure 2: A graphical representation of the pipeline showing the various methods and tools used. All the pink diamonds are clickable and will take you to the respective tool's homepage. The known Cry proteins box will take you to the Cry protein database. </figcaption></figure>
  
<figure><img src="Figure not yet uploaded."><figcaption>Figure 1: A graphical representation of the pipeline showing the various methods and tools used.</figcaption></figure>
 
</p>
 
 
<h2>Software description</h2>
 
<h2>Software description</h2>
 +
<p onclick="javascript:ShowHide('HiddenDiv1')" style="border: 2px solid gray;">Click here for a highly detailed overview of how the pipeline works.</p>
 +
<div class="mid" id="HiddenDiv1" style="display: none; border: 2px solid gray;">
 
<h3>Genome assembly</h3>
 
<h3>Genome assembly</h3>
 
<p>Raw genome data directly from a Next Generation Sequencing device comes in the form of files containing many pieces of DNA called ‘reads’. These reads are usually obtained through cutting up the genome and sequencing the small bits at random after amplifying them many times. The idea is that many reads will overlap and can be assembled like a big puzzle. Our pipeline uses a <i>de novo</i> assembler, meaning that it can assemble the reads from scratch, without a reference genome to map them to. This is done by a  
 
<p>Raw genome data directly from a Next Generation Sequencing device comes in the form of files containing many pieces of DNA called ‘reads’. These reads are usually obtained through cutting up the genome and sequencing the small bits at random after amplifying them many times. The idea is that many reads will overlap and can be assembled like a big puzzle. Our pipeline uses a <i>de novo</i> assembler, meaning that it can assemble the reads from scratch, without a reference genome to map them to. This is done by a  
 
<a class="tooltip"> 'graph-based approach' <span class="tooltiptext" style="width:500px;">
 
<a class="tooltip"> 'graph-based approach' <span class="tooltiptext" style="width:500px;">
 
The nodes of the graph represent whole reads or parts of each read called k-mers, while the edges represent the overlap between them.</span></a>.  
 
The nodes of the graph represent whole reads or parts of each read called k-mers, while the edges represent the overlap between them.</span></a>.  
<sup><a href="#ts3" id="ref_ts3">3</a></sup>   
+
<sup><a href="#ts7" id="ref_ts7">7</a></sup>   
Once the genome is assembled into <a class="tooltip">contigs <span class="tooltiptext" style="width:500px;">
+
Once the genome is assembled into <a class="tooltip">contigs<span class="tooltiptext" style="width:500px;">
 
A contig is a closed loop in the graph, the longest stretches of DNA possible from all of the reads.</span></a>
 
A contig is a closed loop in the graph, the longest stretches of DNA possible from all of the reads.</span></a>
 
  we can start doing gene prediction. In this pipeline we used the  
 
  we can start doing gene prediction. In this pipeline we used the  
<a class="tooltip"> IDBA-UD <span class="tooltiptext" style="width:500px;">
+
<a class="tooltip" href=http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/> IDBA-UD <span class="tooltiptext" style="width:500px;">
The Iterative De Bruijn Graph De Novo Assembler for highly Uneven sequencing Depth (<a href=http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/>IDBA-UD</a>) is a tool for assembling genomes in a <i>de novo</i> manner. </span></a>.
+
The Iterative De Bruijn Graph De Novo Assembler for highly Uneven sequencing Depth (IDBA-UD) is a tool for assembling genomes in a <i>de novo</i> manner </span></a>.
<sup><a href="#ts15" id="ref_ts15">15</a></sup>   
+
<sup><a href="#ts8" id="ref_ts8">8</a></sup>   
 
</p>
 
</p>
 
<h3>Gene prediction</h3>
 
<h3>Gene prediction</h3>
<p>The pipeline does gene prediction through the use of Hidden Markov Models (HMM) <sup><a href="#ts4" id="ref_ts4">4</a></sup>   
+
<p>The pipeline does gene prediction through the use of Hidden Markov Models (HMM) <sup><a href="#ts9" id="ref_ts9">9</a></sup>   
that have been trained to distinguish between prokaryotic coding and non-coding regions. These regions are then translated using BioPython's <sup><a href="#ts6" id="ref_ts6">6</a></sup> translate function. These are the proteins we compare to our existing  <a href=http://www.lifesci.sussex.ac.uk/home/Neil_Crickmore/Bt/toxins2.html> Cry Protein database</a> <sup><a href="#ts5" id="ref_ts5">5</a></sup>   
+
that have been trained to distinguish between prokaryotic coding and non-coding regions. These regions are then translated using BioPython's <sup><a href="#ts10" id="ref_ts10">10</a></sup> translate function. These are the proteins we compare to our existing  <a href=http://www.lifesci.sussex.ac.uk/home/Neil_Crickmore/Bt/toxins2.html> Cry Protein database</a> <sup><a href="#ts11" id="ref_ts11">11</a></sup>   
 
In this pipeline we used the  
 
In this pipeline we used the  
<a class="tooltip"> GeneMarkS <span class="tooltiptext" style="width:500px;">
+
<a href=http://exon.gatech.edu/GeneMark/ class="tooltip"> GeneMarkS <span class="tooltiptext" style="width:500px;">
<a href=http://exon.gatech.edu/GeneMark/> GeneMarkS</a> utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes.
+
utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. </span></a>  
</span></a>  
+
program for this purpose. <sup><a href="#ts12" id="ref_ts12">12</a></sup>   
program for this purpose. <sup><a href="#ts15" id="ref_ts15">15</a></sup>   
+
 
  </p>
 
  </p>
 
<h3>Machine Learning </h3>
 
<h3>Machine Learning </h3>
<p>As cry proteins are a group that is hard to define biologically, other than having specific insecticidal activity, it is somewhat of a challenge to find them through means other than direct sequence comparisons. This is why we use three methods to find potential cry proteins: <ol>  
+
<p>As Cry proteins are a group that is hard to define biologically, other than having highly specific toxic activity, it is somewhat of a challenge to find them through means other than direct sequence comparisons. This is why we use three methods to find potential Cry proteins: <ol>  
<li>A direct sequence comparison with a stringent cut-off (e-value 0.1 or lower):  
+
<li>A direct sequence comparison with a stringent cut-off defined to be an <a class="tooltip">E value<span class="tooltiptext" style="width:500px;">Expect value is the number of hits one can "expect" to see by chance when searching a database of a particular size.</span></a>of 0.1 or lower and an alignment overlap of at least 50%:  
<a class="tooltip"> BLAST <span class="tooltiptext" style="width:500px;">
+
<a class="tooltip">BLAST<span class="tooltiptext" style="width:500px;">
The Basic Local Alignment Search Tool (<b>BLAST</b>) is an algorithm for comparing sequence information, such as the amino-acid (protein) or nucleic-acid (DNA/RNA) sequences.  
+
The Basic Local Alignment Search Tool (<b>BLAST</b>) is an algorithm for comparing sequence information, such as amino-acid (protein) or nucleic-acid (DNA/RNA) sequences.</span></a> <sup><a href="#ts13" id="ref_ts13">13</a></sup>  
</span></a>  
+
<br/>
<sup><a href="#ts8" id="ref_ts8">8</a></sup> Using a database of only cry proteins, this tool will allow us to rapidly identify other proteins with highly similar sequences.
+
Using a database of only Cry proteins, this tool will allow us to rapidly identify other proteins with highly similar sequences.
 
</li>
 
</li>
<li>A probabilistic approach based on the primary subdivision between cry proteins (45% sequence similarity): HMMER.
+
<li>A probabilistic approach based on the primary subdivision between Cry proteins (45% sequence similarity): HMMER.
<br>
+
<br/>
The HMMER toolkit contains several programs that the pipeline uses. <sup><a href="#ts9" id="ref_ts9">9</a></sup> The program first builds many different profiles based on the multiple sequence alignment of the primary cry protein groups. This is handled by Clustal Omega. <sup><a href="#ts10" id="ref_ts10">10</a></sup>  
+
The HMMER toolkit contains several programs that the pipeline uses. <sup><a href="#ts14" id="ref_ts14">14</a></sup> The program first builds many different profiles based on the multiple sequence alignment of the primary Cry protein groups. This is handled by Clustal Omega. <sup><a href="#ts15" id="ref_ts15">15</a></sup>  
 
<ul>
 
<ul>
 
<li>HMMbuild: Builds a profile HMM from an input multiple alignment.</li>
 
<li>HMMbuild: Builds a profile HMM from an input multiple alignment.</li>
Line 84: Line 115:
 
<li>HMMscan: Searches a protein sequence against a protein profile HMM database. </li>
 
<li>HMMscan: Searches a protein sequence against a protein profile HMM database. </li>
 
</ul>
 
</ul>
All of these methods are still based on the same Hidden Markov Model principle <sup><a href="#ts4" id="ref_ts4">4</a></sup>.  </li>
+
All of these methods are still based on the same Hidden Markov Model principle <sup><a href="#ts9" id="ref_ts9">9</a></sup>.  </li>
  
<li>A machine learning approach based on protein features using many weak classifiers to give a majority vote: "RandomForest".<sup><a href="#ts7" id="ref_ts7">7</a></sup> This part is explained further in the protein classification section.
+
<li>A machine learning approach based on protein features using many weak classifiers to give a majority vote: "RandomForest".<sup><a href="#ts16" id="ref_ts16">16</a></sup> This part is explained further in the protein classification section.
 
</li>
 
</li>
 
</ol>
 
</ol>
Line 92: Line 123:
 
</p>
 
</p>
 
</p><h3>Protein Classification</h3><p>
 
</p><h3>Protein Classification</h3><p>
The protein classification part of the machine learning is done by the pipeline in two steps: First the features of a protein are predicted using Biopython <sup><a href="#ts6" id="ref_ts6">6</a></sup> , and then the pre-trained model predicts a class based on those features.
+
The protein classification part of the machine learning is done by the pipeline in two steps: First the features of a protein are predicted using Biopython <sup><a href="#ts10" id="ref_ts10">10</a></sup> , and then the pre-trained model predicts a class based on those features.
 
</p><h4>Features</h4><p>
 
</p><h4>Features</h4><p>
 
After some experimentation with the features chosen for modelling, we came up with the following list which we think will capture essential structural elements and allow for good predictions. <ul>
 
After some experimentation with the features chosen for modelling, we came up with the following list which we think will capture essential structural elements and allow for good predictions. <ul>
 
<li>Molecular weight of the protein.</li>
 
<li>Molecular weight of the protein.</li>
 
<li>Aromaticity, ie the relative frequency of aromatic amino acids (Phenylalanine, Tryptophan and Tyrosine)</li>
 
<li>Aromaticity, ie the relative frequency of aromatic amino acids (Phenylalanine, Tryptophan and Tyrosine)</li>
<li>Instability index, ie the relative frequency of instability causing dipeptides, for more information see <a href=http://peds.oxfordjournals.org/content/4/2/155>Guruprasad et al 1990.</a><sup><a href="#ts11" id="ref_ts11">11</a></sup> </li>
+
<li>Instability index, ie the relative frequency of instability causing dipeptides, for more information see <a href=http://peds.oxfordjournals.org/content/4/2/155>Guruprasad et al 1990.</a><sup><a href="#ts17" id="ref_ts17">17</a></sup> </li>
 
<li>Isoelectric point, which is the pH at which a particular molecule carries no net electrical charge.</li>
 
<li>Isoelectric point, which is the pH at which a particular molecule carries no net electrical charge.</li>
 
<li>Alpha Helix fraction, the relative frequency of amino acids that tend to be in found in the secondary structure known as alpha helices. </li>
 
<li>Alpha Helix fraction, the relative frequency of amino acids that tend to be in found in the secondary structure known as alpha helices. </li>
Line 104: Line 135:
 
</ul>
 
</ul>
 
</p><h4>Scikit-Learn</h4><p>
 
</p><h4>Scikit-Learn</h4><p>
We made the classification model using scikit-learn or sklearn.<sup><a href="#ts12" id="ref_ts12">12</a></sup> The class used for this model is called the <a href=http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>RandomForestClassifier</a> which works by fitting a number of decision trees, in our case 2000, on various sub-samples of the dataset and averaging these weak classifiers to improve accuracy and control for.  
+
We made the classification model using scikit-learn or sklearn.<sup><a href="#ts18" id="ref_ts18">18</a></sup> The class used for this model is called the <a href=http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>RandomForestClassifier</a> which works by fitting a number of decision trees, in our case 2000, on various sub-samples of the dataset and averaging these weak classifiers to improve accuracy and control for.  
 
<a class="tooltip">overfitting<span class="tooltiptext" style="width:500px;">
 
<a class="tooltip">overfitting<span class="tooltiptext" style="width:500px;">
 
Overfitting is what happens when a model follows its training data so well it can not predict anything new with any accuracy because too much of the noise from the training data has been modelled, rather than the underlying relationship.
 
Overfitting is what happens when a model follows its training data so well it can not predict anything new with any accuracy because too much of the noise from the training data has been modelled, rather than the underlying relationship.
Line 116: Line 147:
 
<h3>Venn-Diagrams</h3>
 
<h3>Venn-Diagrams</h3>
 
<p>All three separate methods give a different output. In order to make an easy visual comparison, we make use of a web tool that can easily calculate and draw custom Venn-diagrams. This tool was made by the university of Gent and can be found <a href=http://bioinformatics.psb.ugent.be/webtools/Venn/>here</a>.  
 
<p>All three separate methods give a different output. In order to make an easy visual comparison, we make use of a web tool that can easily calculate and draw custom Venn-diagrams. This tool was made by the university of Gent and can be found <a href=http://bioinformatics.psb.ugent.be/webtools/Venn/>here</a>.  
 +
</div>
 +
<p onclick="javascript:ShowHide('HiddenDiv1')" style="border: 2px solid gray;">Click here to open or close the overview.</p>
 
</section>
 
</section>
 
 
<section id="TS_results">
 
<section id="TS_results">
<h1>Pipeline validation</h1>
+
<h1><i>Varroa</i> Isolate results</h1>
<h2>PRJEB5931</h2><p>
+
<p> The <a href="https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates4"><i>Varroa</i> isolates experiment</a> found an isolate of interest which warrented more analysis. This potential <i>Varroa</i>-killing bacteria was sequenced using Next Generation Sequencing (NGS). According to the 16S analysis results as seen in <a href="https://2016.igem.org/Team:Wageningen_UR/Notebook/VarroaIsolates">Lisa's notebook</a> this isolate appeared to be a close match to <i>Lysinibacillus</i>.  
We tested the pipeline on a genome sample from a study with accession number <a href=http://www.ebi.ac.uk/ena/data/view/PRJEB5931> PRJEB5931</a>. <sup><a href="#ts13" id="ref_ts13">13</a></sup> This genome was found after a co-evolution experiment, and a <i>Bacillus thuringiensis</i> with known nematicidal cry proteins present.  
+
From over 17.5 million reads, the pipeline assembled the sequencing results into a genome consisting of 822 contigs. From this assembled genome the pipeline was able to find 4,427 genes, of which 4 were identified as potential Cry proteins. These were further investigated. For the results, click on the button below. More statistics about the assembly can be found in Table 1.</p>
<br>
+
<a href=http://www.ebi.ac.uk/ena/data/view/ERX463573>ERX463573</a> is the accession code of the experiment from which these raw read files came. The description reads: "Population of Bacillus thuringiensis coming from 5 strains experimentally coevolved with Caenorhabditis as host" and the Name: "Coevolution G12 Pop4". From the study we know to expect at the very least the following two proteins:
+
<a href=https://www.ncbi.nlm.nih.gov/nucleotide/47500285> Cry35Aa4 </a>
+
and
+
<a href=https://www.ncbi.nlm.nih.gov/nucleotide/2724454>Cry21Aa2</a>. <br>
+
According to the cry protein toxin list it is known that Cry35Aa4 is a binary toxin with <a href=https://www.ncbi.nlm.nih.gov/nucleotide/47500295> Cry34Aa4 </a> and as such we may expect to find this protein, or one like it, as well.
+
  
 +
<figure><figcaption>Table 1. Genome Assembly statistics from the <a href="https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates4"><i>Varroa</i> isolates experiment</a> </figcaption></figure>
 +
<table>
 +
<tbody>
 +
<tr>
 +
<td >Total reads</td>
 +
<td>17,579,690</td>
 +
</tr>
 +
<tr>
 +
<td>Number of aligned reads</td>
 +
<td>17,078,201</td>
 +
</tr>
 +
<tr>
 +
<td>Expected genome coverage</td>
 +
<td>4.35088</td>
 +
</tr>
 +
<tr>
 +
<td>De Bruijn Graph edges</td>
 +
<td>112</td>
 +
</tr>
 +
<tr>
 +
<td>Total contigs</td>
 +
<td>822</td>
 +
</tr>
 +
<tr>
 +
<td>n50 value</td>
 +
<td>163,362</td>
 +
</tr>
 +
<tr>
 +
<td>Largest contig length</td>
 +
<td>504,583</td>
 +
</tr>
 +
<tr>
 +
<td>Mean contig length</td>
 +
<td>4,905</td>
 +
</tr>
 +
<tr>
 +
<td>Total genome length</td>
 +
<td>4,032,210</td>
 +
</tr>
 +
</tbody>
 +
</table>
  
</p><h2>Key result</h2><p>
 
The easiest way to visualize the results from the three separate components of the pipeline is to use a <a href=http://bioinformatics.psb.ugent.be/webtools/Venn/>Venn diagram</a> (with accompanying lists).
 
  
 +
<figure><a href="https://static.igem.org/mediawiki/2016/9/97/T--Wageningen_UR--pipeline_NGS_result_inspection.pdf"><img src="https://static.igem.org/mediawiki/2016/3/3a/T--Wageningen_UR--toplogobutton.jpg"></a><figcaption>Click the button to go to the screenshot images! </figcaption></figure>
  
<figure><img src=https://static.igem.org/mediawiki/2016/2/2c/T--Wageningen_UR--Venn-Diagram_TS.png><figcaption>Figure 2: A Venn diagram showing the overlap of the output of the various methods used to predict whether or not certain genes from the genome were cry proteins or not.</figcaption></figure>
+
<p>The button links to a pdf which shows 4 screenshots of further examination of "gene_678", which was found to be a potential Cry protein.  
 
+
<br/>
As Blast had only 5 results it is easier to examine this method in detail. Two of the five were proteins we were expecting from the paper: Cry35Aa4 and Cry21Aa2. These were found to be: <ul>
+
<b>Image 1)</b> shows the BLAST result from this gene against the <a class="tooltip"><i>nr</i> database protein<span class="tooltiptext" style="width:500px;">The Non-Redundant (<i>nr</i>) contains non redundant sequences from GenBank translations (i.e. GenPept) together with sequences from other databanks (Refseq, PDB, SwissProt, PIR and PRF).</span></a>. The conserved domains shown are the OrfB_IS605 superfamily and the Cysteine rich_CPCC domain. Neither of these are known to have a functional characterization. The main hits belong to a class of genes known as transposases, which are known to move, and bind to, <a class="tooltip">Transposons<span class="tooltiptext" style="width:500px;"> A transposon is a DNA sequence that can change its position within a genome.</span></a> However, this domain corresponds only to the first 77% of the gene sequence. The remaining 23% requires further investigation.
<li>Gene_5932, which had 100.00% identity with: Cry35Aa4 </li>
+
<br/>
<li>Gene_5527, which had 96.04% identity with: Cry21Aa2</li>
+
<b>Image 2)</b> shows the BLAST result from the remaining 23% of the gene sequence against the <i>nr</i> protein database (in this image named “Protein Sequence (71 letters)”). The sequence shows similarity to known membrane proteins. Cry proteins are known to interact with the membrane to form the pores needed to kill their target.
<ul>
+
<br/>
Based on the fact that Cry35Aa4 is a binary protein we expected to find its complementary protein as well, which we did: <ul>  
+
<b>Image 3) </b> shows the output of the "Coils" expasy tool <sup><a href="#ts20" id="ref_ts20">4</a></sup>, used to examine the secondary structure of the Cry2Ab8 protein. This image shows that there are probably some coils around the 100 amino acid area.
<li>Gene_5931, which had 100.00% identity with all four proteins in the Cry34Aa group.</li>
+
<br/>
</ul>
+
<b>Image 4)</b> shows the output of the "Coils" expasy tool <sup><a href="#ts20" id="ref_ts20">4</a></sup>, used to examine the secondary structure of the protein resulting from "gene 678". This image shows that there may be some coils at the start of the protein.
Two other proteins appeared with a good overlap and a good e-value, namely: <ul>
+
</p>
<li>Gene_5934, which had 98.39% identity with: Cry38Aa1, which has no known insecticidal target, but is highly similar to proteins that do: Cry15Aa1, Cry23Aa1, and Cry33Aa1. <sup><a href="#ts17" id="ref_ts17">17</a></sup></li>
+
<p>The other 3 proteins show similar patterns and as such were all picked up by the pipeline. Given this we suggest these proteins would be prime candidates for further experimental validation.</p>
<li>Gene_5518, which had 85.87% identity with: Cry14Ab1. Which is only mentioned in a patent by Sampson et al. 2012. <sup><a href="#ts18" id="ref_ts18">18</a></sup> </li>  
+
</ul>
+
Both of these were also picked up by the Hidden Markov Model method of cry protein detection as shown in figure 2, but not by the RandomForest method.  
+
 
+
 
</section>
 
</section>
 
+
<section id="references">
<section id="TS_discussion"><h1>Discussion</h1>
+
<h1>References</h1>
<p>All three methods were fully able to pick up the nematicidal proteins from the study we had expected to find, and the complementary protein we predicted to be present ourselves.
+
<ol class="references">
<h4> Blast</h4>
+
<a id="ts1" href=http://www.ncbi.nlm.nih.gov/pubmed/11275324>1.</a>de Maagd, R. A., Bravo, A., & Crickmore, N. (2001). How <i>Bacillus thuringiensis</i> has evolved specific toxins to colonize the insect world. TRENDS in Genetics, 17(4), 193-199.
As we had intended the Blast method proved to be the most stringent, detecting at least known cry proteins with high accuracy. However, in order to find new cry proteins that perform a similar function to a known cry protein but are not exactly like one, we can still turn to the other methods.
+
<a href="#ref_ts1" title="Jump back to footnote 1 in the text.">↩</a>
<h4> HMM </h4>
+
<br/><br/>
 
+
<h4> RandomForest </h4>
+
The ML could be further optimized based on a more rigorous examination of what is and what is not a cry protein.
+
For 5934 the RF prediction didn’t happen because it did not have the right amount of aromatic residues, which caused a high increase in the probability of it not being a cry protein. (Instance 23 of the last list of RF_analysis.txt)
+
 
+
The idea behind the training set used for this model was that the model should be able to distinguish what is defined as cry and what is not cry, but still similar in sequence. To that end the uniprot database was queried and 200 proteins that came back using this query in the uniprot web interface:
+
<code> [my list] NOT go:"sporulation resulting in formation of a cellular spore [0030435]" NOT family:"delta endotoxin" NOT cry NOT pesticidal active:yes NOT annotation:(type:"positional domain" "Endotoxin M") NOT name:crystal</code>
+
This list of 200 proteins was investigated (somewhat)
+
</p>
+
</section><h1>References</h1>
+
<a id="ts1" href=http://aem.asm.org/content/78/14/4795.short>1.</a> Ye, W., Zhu, L., Liu, Y., Crickmore, N., Peng, D., Ruan, L., & Sun, M. (2012). Mining new crystal protein genes from <i>Bacillus thuringiensis</i> on the basis of mixed plasmid-enriched genome sequencing and a computational pipeline. Applied and environmental microbiology, 78(14), 4795-4801.
+
<a href="#ts1" title="Jump back to footnote 1 in the text.">↩</a>
+
<br><br>
+
 
<a id="ts2" href=http://link.springer.com/article/10.1007/s13592-014-0288-z>2.</a> Alquisira-Ramírez, E. V., Paredes-Gonzalez, J. R., Hernández-Velázquez, V. M., Ramírez-Trujillo, J. A., & Peña-Chora, G. (2014). In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of <i>Bacillus thuringiensis</i>. Apidologie, 45(6), 707-718.
 
<a id="ts2" href=http://link.springer.com/article/10.1007/s13592-014-0288-z>2.</a> Alquisira-Ramírez, E. V., Paredes-Gonzalez, J. R., Hernández-Velázquez, V. M., Ramírez-Trujillo, J. A., & Peña-Chora, G. (2014). In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of <i>Bacillus thuringiensis</i>. Apidologie, 45(6), 707-718.
<a href="#ts2" title="Jump back to footnote 2 in the text.">↩</a>
+
<a href="#ref_ts2" title="Jump back to footnote 2 in the text.">↩</a>
<br><br>
+
<br/><br/>
<a id="ts3" href= http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html>3.</a> Compeau, P. E., Pevzner, P. A., & Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29(11), 987-991.
+
<a id="ts3" href=http://aem.asm.org/content/78/14/4795.short>3.</a> Ye, W., Zhu, L., Liu, Y., Crickmore, N., Peng, D., Ruan, L., & Sun, M. (2012). Mining new crystal protein genes from  <i>Bacillus thuringiensis</i> on the basis of mixed plasmid-enriched genome sequencing and a computational pipeline. Applied and environmental microbiology, 78(14), 4795-4801.
<a href="#ts3" title="Jump back to footnote 3 in the text.">↩</a>
+
<a href="#ref_ts3" title="Jump back to footnote 3 in the text.">↩</a>
<br><br>
+
<br/><br/>
<a id="ts4" href=http://ieeexplore.ieee.org/document/1165342/>4.</a> Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.
+
<a id="ts4" href= http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002169 >4.</a>  Masri, L., Branca, A., Sheppard, A. E., Papkou, A., Laehnemann, D., Guenther, P. S., ... & Brzuszkiewicz, E. (2015). Host–pathogen coevolution: the selective advantage of Bacillus thuringiensis virulence and its cry toxin genes. PLoS Biol, 13(6), e1002169.
<a href="#ts4" title="Jump back to footnote 4 in the text.">↩</a>
+
<a href="#ref_ts4" title="Jump back to footnote 4 in the text.">↩</a>
<br><br>
+
<br/><br/>
<a id="ts5" href=http://www.btnomenclature.info/  >5.</a>  Crickmore, N., Baum, J., Bravo, A., Lereclus, D., Narva, K., Sampson, K., Schnepf, E., Sun, M. and Zeigler, D.R. " <i>Bacillus thuringiensis</i> toxin nomenclature" (2016)
+
<a id="ts5" href=http://aem.asm.org/content/70/8/4889.long>5.</a>  Baum, J. A., Chu, C. R., Rupar, M., Brown, G. R., Donovan, W. P., Huesing, J. E., ... & Vaughn, T. (2004). Binary toxins from Bacillus thuringiensis active against the western corn rootworm, Diabrotica virgifera virgifera LeConte. Applied and environmental microbiology, 70(8), 4889-4898.
 +
<a href="#ref_ts5" title="Jump back to footnote 5 in the text.">↩</a>
 +
<br/><br/>
 +
<a id="ts6" href= https://www.google.com/patents/US8318900 >6.</a> Sampson, K. S., Tomso, D. J., & Dumitru, R. V. (2012). U.S. Patent No. 8,318,900. Washington, DC: U.S. Patent and Trademark Office. 
 +
<a href="#ref_ts6" title="Jump back to footnote 6 in the text.">↩</a>
 +
<br/><br/>
 +
<a id="ts7" href= http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html>7.</a> Compeau, P. E., Pevzner, P. A., & Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29(11), 987-991.
 +
<a href="#ref_ts7" title="Jump back to footnote 7 in the text.">↩</a>
 +
<br/><br/>
 +
<a id="ts8" href=http://bioinformatics.oxfordjournals.org/content/28/11/1420.short  >8.</a>  Peng, Y., Leung, H. C., Yiu, S. M., & Chin, F. Y. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 28(11), 1420-1428.
 +
<a href="#ref_ts8" title="Jump back to footnote 8 in the text.">↩</a>
 +
<br/><br/>
 +
<a id="ref_ts9" href=http://ieeexplore.ieee.org/document/1165342/>9.</a> Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.
 +
<a href="#ref_ts9" title="Jump back to footnote 9 in the text.">↩</a>
 +
<br/><br/>
 +
<a id="ts10" href=http://dx.doi.org/10.1093/bioinformatics/btp163 >10.</a>  Cock PA, Antao T, Chang JT, Bradman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
 +
<a href="#ref_ts10" title="Jump back to footnote 10 in the text.">↩</a>
 +
<br/><br/>
 +
<a id="ts11" href=http://www.btnomenclature.info/  >11.</a>  Crickmore, N., Baum, J., Bravo, A., Lereclus, D., Narva, K., Sampson, K., Schnepf, E., Sun, M. and Zeigler, D.R. " <i>Bacillus thuringiensis</i> toxin nomenclature" (2016)
 
http://www.btnomenclature.info/  
 
http://www.btnomenclature.info/  
<a href="#ts5" title="Jump back to footnote 5 in the text.">↩</a>
+
<a href="#ref_ts11" title="Jump back to footnote 11 in the text.">↩</a>
<br><br>
+
<br/><br/>
<a id="ts6" href=http://dx.doi.org/10.1093/bioinformatics/btp163 >6.</a> Cock PA, Antao T, Chang JT, Bradman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
+
<a id="ts12" href=http://www.ncbi.nlm.nih.gov/pubmed/11410670>12.</a>   Besemer, J., Lomsadze, A., & Borodovsky, M. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic acids research, 29(12), 2607-2618.
<a href="#ts6" title="Jump back to footnote 6 in the text.">↩</a>
+
<a href="#ref_ts12" title="Jump back to footnote 12 in the text.">↩</a>
<br><br>
+
<br/><br/>
<a id="ts7" href=ftp://131.252.97.79/Transfer/Treg/WFRE_Articles/Liaw_02_Classification%20and%20regression%20by%20randomForest.pdf >7.</a>  Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.
+
<a id="ts13" href=https://www.ncbi.nlm.nih.gov/pubmed/2231712 >13.</a>  Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.
<a href="#ts7" title="Jump back to footnote 7 in the text.">↩</a>
+
<a href="#ref_ts13" title="Jump back to footnote 13 in the text.">↩</a>
<br><br>
+
<br/><br/>
<a id="ts8" href= >@.</a>  Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.
+
<a id="ts14" href=www.hmmr.org >14.</a>  Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755-763.
<a href="#ts8" title="Jump back to footnote 8 in the text.">↩</a>
+
<a href="#ref_ts14" title="Jump back to footnote 14 in the text.">↩</a>
<br><br>
+
<br/><br/>
<a id="ts9" href=www.hmmr.org >9.</a>  Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755-763.
+
<a id="ts15" href=http://link.springer.com/protocol/10.1007/978-1-62703-646-7_6  >15.
<a href="#ts9" title="Jump back to footnote 9 in the text.">↩</a>
+
</a>  Sievers, F., & Higgins, D. G. (2014). Clustal Omega, accurate alignment of very large numbers of sequences. Multiple sequence alignment methods, 105-116.
<br><br>
+
<a href="#ref_ts15" title="Jump back to footnote 15 in the text.">↩</a>
<a id="ts10" href=http://link.springer.com/protocol/10.1007/978-1-62703-646-7_6  >10.</a>  Sievers, F., & Higgins, D. G. (2014). Clustal Omega, accurate alignment of very large numbers of sequences. Multiple sequence alignment methods, 105-116.
+
<br/><br/>
<a href="#ts10 " title="Jump back to footnote 10 in the text.">↩</a>
+
<a id="ts16" href=ftp://131.252.97.79/Transfer/Treg/WFRE_Articles/Liaw_02_Classification%20and%20regression%20by%20randomForest.pdf >16.</a> Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.
<br><br>
+
<a href="#ref_ts16" title="Jump back to footnote 16 in the text.">↩</a>
<a id="ts11" href=http://peds.oxfordjournals.org/content/4/2/155  >11.</a>  Guruprasad, K., Reddy, B. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein engineering, 4(2), 155-161.
+
<br/><br/>
<a href="#ts11 " title="Jump back to footnote 11  in the text.">↩</a>
+
<a id="ts17" href=http://peds.oxfordjournals.org/content/4/2/155  >17.</a>  Guruprasad, K., Reddy, B. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein engineering, 4(2), 155-161.
<br><br>
+
<a href="#ref_ts17" title="Jump back to footnote 17 in the text.">↩</a>
<a id="ts12" href=http://scikit-learn.org/stable/  >12.</a>  Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.
+
<br/><br/>
<a href="#ts12 " title="Jump back to footnote 12 in the text.">↩</a>
+
<a id="ts18" href=http://scikit-learn.org/stable/  >18.</a>  Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.
<br><br>
+
<a href="#ref_ts18" title="Jump back to footnote 18 in the text.">↩</a>
<a id="ts13" href= http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002169 >13.</a>   Masri, L., Branca, A., Sheppard, A. E., Papkou, A., Laehnemann, D., Guenther, P. S., ... & Brzuszkiewicz, E. (2015). Host–pathogen coevolution: the selective advantage of Bacillus thuringiensis virulence and its cry toxin genes. PLoS Biol, 13(6), e1002169.
+
<br/><br/>
<a href="#ts13 " title="Jump back to footnote 13  in the text.">↩</a>
+
<a id="ts19" href="http://www.sciencedirect.com/science/article/pii/S0031320396001422">19.</a> Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.  
<br><br>
+
<a href="#ref_ts19" title="Jump back to footnote 19 in the text.">↩</a>
<a id="ts14" href=http://www.ncbi.nlm.nih.gov/pubmed/11275324>14.</a>de Maagd, R. A., Bravo, A., & Crickmore, N. (2001). How  <i>Bacillus thuringiensis</i> has evolved specific toxins to colonize the insect world. TRENDS in Genetics, 17(4), 193-199.
+
<br/><br/>
<a href="#ts14 " title="Jump back to footnote 14  in the text.">↩</a><br><br>
+
<a id="ts20" href="https://www.researchgate.net/profile/Andrei_Lupas/publication/6042403_Predicting_coiled_coils_from_protein_sequences/links/00b4951710268a2a2d000000.pdf" >20.</a> Lupas, A., Van Dyke, M., and Stock, J. (1991) Predicting Coiled Coils from Protein Sequences,Science 252:1162-1164.
<a id="ts15" href=http://bioinformatics.oxfordjournals.org/content/28/11/1420.short  >15.</a>  Peng, Y., Leung, H. C., Yiu, S. M., & Chin, F. Y. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 28(11), 1420-1428.
+
<a href="#ref_ts20" title="Jump back to footnote 20 in the text.">↩</a>
<a href="#ts15 " title="Jump back to footnote 15 in the text.">↩</a>
+
</ol>
<br><br>
+
</section>
<a id="ts16" href=http://www.ncbi.nlm.nih.gov/pubmed/11410670>16.</a>  Besemer, J., Lomsadze, A., & Borodovsky, M. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic acids research, 29(12), 2607-2618.
+
<a href="#ts16 " title="Jump back to footnote 16  in the text.">↩</a>
+
<br><br>
+
<a id="ts17" href=http://aem.asm.org/content/70/8/4889.long  >17.</a>   Baum, J. A., Chu, C. R., Rupar, M., Brown, G. R., Donovan, W. P., Huesing, J. E., ... & Vaughn, T. (2004). Binary toxins from Bacillus thuringiensis active against the western corn rootworm, Diabrotica virgifera virgifera LeConte. Applied and environmental microbiology, 70(8), 4889-4898.
+
<a href="#ts17 " title="Jump back to footnote 17  in the text.">↩</a>
+
<br><br>
+
<a id="ts18" href= https://www.google.com/patents/US8318900 >18.</a> Sampson, K. S., Tomso, D. J., & Dumitru, R. V. (2012). U.S. Patent No. 8,318,900. Washington, DC: U.S. Patent and Trademark Office. 
+
<a href="#ts18" title="Jump back to footnote 18  in the text.">↩</a>
+
<br><br>
+
<a id="ts19" href=  >19.</a>  Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.
+
<a href="#ts19 " title="Jump back to footnote 19  in the text.">↩</a>
+
<br><br>
+
 
+
 
</html>
 
</html>
 
{{Wageningen_UR/footer}}
 
{{Wageningen_UR/footer}}

Latest revision as of 03:17, 20 October 2016

Wageningen UR iGEM 2016

 

Toxin Scanner

BioBrick discovery

For iGEM 2016 we designed a high-throughput pipeline for the identification of novel proteins directly from raw genome sequencing data. Given the specificity of our tool and the importance of biobrick discovery in iGEM, we made it publicly available for everyone to modify and use. The tool can be found in this Gitlab repository .
For the purpose of the BeeT project, we use it as a cry toxin predictor, given genomes of selected bacteria.

Toxin Specificity

For this project we need a toxin that specifically targets Varroa destructor. The most well known miticidal proteins are the crystal (Cry) proteins. These are usually found on megaplasmids from Bacillus Thuringiensis and related species. 1

We know Varroa-specific miticidal activity exists in Bacillus thuringiensis and related species, as shown in: "In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of Bacillus thuringiensis." by Alquisira-Ramírez et al. 2 In this paper, several isolates are described that cause a mite mortality of up to 100%. Importantly, the strains also showed no miticidal activity against bee larvae. Because of this we started several sub-projects in parallel to maximize our chances of finding a viable V. destructor-killer. The specificity part of our project focuses on creating V. destructor-gut binding Cry toxins and finding V. destructor-specific miticidal proteins.

There already exists a publication about a tool called “Bt Toxin Scanner”3. This tool does not fully support local deployment, which is needed for high-throughput analysis. Also, because of the relatively basic analysis done by the tool, we decided to develop our own tool that is fully open-source and improves upon the analysis techniques used in Bt Toxin Scanner. Our goal with this tool is to run raw sequencing files, and deliver potential Cry proteins with just the click of a button.

This tool was made in preparation for results of the latter, finding V. destructor-specific miticidal proteins, which we assume to be of the Cry protein family. These Cry proteins are a diverse group, but are known to be highly specific for individual insects, acari, nematodes and various other eukaryotic taxa. Cry proteins are not necessarily a group of proteins that all perform the same function in the same manner. The distinction between Cry and non-Cry proteins is defined by a committee: Cry Protein website Based on 45% sequence similarity there are over 70 groups. This high amount of diversity makes it hard to predict when something is or isn't a Cry protein. Despite this diversity, many of them have the same three domain structure. The N-terminal domain I is involved in membrane insertion and pore formation, while domains II and III are involved in receptor recognition and binding to them.

Testing the tool

We tested the tool on a genome sample from a study with accession number PRJEB5931. 4 This genome was found after a co-evolution experiment, and a Bacillus thuringiensis with known nematicidal Cry proteins present.
ERX463573 is the accession code of the experiment from which these raw read files came. The experiment this came from was about a Population of Bacillus thuringiensis which were coevolved with Caenorhabditis, which is a kind of nematode, as host. From the study we know to expect at the very least the following two proteins: Cry35Aa4 and Cry21Aa2.
According to the Cry protein toxin list it is known that Cry35Aa4 is a binary toxin with Cry34Aa4 , and as such we may expect to find this protein, or one like it, as well.

Visualizing the result

The easiest way to visualize the results from the three separate components of the tool is to use a Venn diagram.

Figure 1: A Venn diagram showing the overlap of the output of the various methods used to predict whether or not certain genes from the genome are Cry proteins or not.

As Blast had only 5 results it is easier to examine this method in detail. Two of the five were indeed proteins we expected from the paper: Cry35Aa4 and Cry21Aa2. Both of these were also picked up by the Hidden Markov Model method of cry protein detection as shown in figure 1, but not by the RandomForest method.
Next we found a protein that matched very well with the Cry34Aa group, which is a complimentary protein to Cry35Aa4.
The two other proteins which were found: Cry38Aa1, which has no known insecticidal target, but is highly similar to proteins that do: Cry15Aa1, Cry23Aa1, and Cry33Aa1. 5
and "Gene_5518" which was 85.87% identical to: Cry14Ab1. This protein is only mentioned in a patent by Sampson et al. 2012.6 This protein is quite interesting because it might be a completely new protein or a variation of Cry14Ab1 specific to nematodes.

Tool overview

We use a combination of existing tools to come to the prediction of novel cry proteins. The entire pipeline consists of four scripts in total, one of which is entirely dedicated to analysis of the Random Forest model and not further used in the main program. The others are there to group the Machine Learning specific functions, the functions that handle known cry proteins, and the functions that handle raw sequence data. Figure 2 gives a graphical representation of the pipeline, though some modules have been left out for the sake of readability. Here, we go through each part of the process in a step-by-step manner.

idba_ud genemark Cry protein hmmscan blastp RandomForest
Figure 2: A graphical representation of the pipeline showing the various methods and tools used. All the pink diamonds are clickable and will take you to the respective tool's homepage. The known Cry proteins box will take you to the Cry protein database.

Software description

Click here for a highly detailed overview of how the pipeline works.

Click here to open or close the overview.

Varroa Isolate results

The Varroa isolates experiment found an isolate of interest which warrented more analysis. This potential Varroa-killing bacteria was sequenced using Next Generation Sequencing (NGS). According to the 16S analysis results as seen in Lisa's notebook this isolate appeared to be a close match to Lysinibacillus. From over 17.5 million reads, the pipeline assembled the sequencing results into a genome consisting of 822 contigs. From this assembled genome the pipeline was able to find 4,427 genes, of which 4 were identified as potential Cry proteins. These were further investigated. For the results, click on the button below. More statistics about the assembly can be found in Table 1.

Table 1. Genome Assembly statistics from the Varroa isolates experiment
Total reads 17,579,690
Number of aligned reads 17,078,201
Expected genome coverage 4.35088
De Bruijn Graph edges 112
Total contigs 822
n50 value 163,362
Largest contig length 504,583
Mean contig length 4,905
Total genome length 4,032,210
Click the button to go to the screenshot images!

The button links to a pdf which shows 4 screenshots of further examination of "gene_678", which was found to be a potential Cry protein.
Image 1) shows the BLAST result from this gene against the nr database proteinThe Non-Redundant (nr) contains non redundant sequences from GenBank translations (i.e. GenPept) together with sequences from other databanks (Refseq, PDB, SwissProt, PIR and PRF).. The conserved domains shown are the OrfB_IS605 superfamily and the Cysteine rich_CPCC domain. Neither of these are known to have a functional characterization. The main hits belong to a class of genes known as transposases, which are known to move, and bind to, Transposons A transposon is a DNA sequence that can change its position within a genome. However, this domain corresponds only to the first 77% of the gene sequence. The remaining 23% requires further investigation.
Image 2) shows the BLAST result from the remaining 23% of the gene sequence against the nr protein database (in this image named “Protein Sequence (71 letters)”). The sequence shows similarity to known membrane proteins. Cry proteins are known to interact with the membrane to form the pores needed to kill their target.
Image 3) shows the output of the "Coils" expasy tool 4, used to examine the secondary structure of the Cry2Ab8 protein. This image shows that there are probably some coils around the 100 amino acid area.
Image 4) shows the output of the "Coils" expasy tool 4, used to examine the secondary structure of the protein resulting from "gene 678". This image shows that there may be some coils at the start of the protein.

The other 3 proteins show similar patterns and as such were all picked up by the pipeline. Given this we suggest these proteins would be prime candidates for further experimental validation.

References

    1.de Maagd, R. A., Bravo, A., & Crickmore, N. (2001). How Bacillus thuringiensis has evolved specific toxins to colonize the insect world. TRENDS in Genetics, 17(4), 193-199.

    2. Alquisira-Ramírez, E. V., Paredes-Gonzalez, J. R., Hernández-Velázquez, V. M., Ramírez-Trujillo, J. A., & Peña-Chora, G. (2014). In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of Bacillus thuringiensis. Apidologie, 45(6), 707-718.

    3. Ye, W., Zhu, L., Liu, Y., Crickmore, N., Peng, D., Ruan, L., & Sun, M. (2012). Mining new crystal protein genes from Bacillus thuringiensis on the basis of mixed plasmid-enriched genome sequencing and a computational pipeline. Applied and environmental microbiology, 78(14), 4795-4801.

    4. Masri, L., Branca, A., Sheppard, A. E., Papkou, A., Laehnemann, D., Guenther, P. S., ... & Brzuszkiewicz, E. (2015). Host–pathogen coevolution: the selective advantage of Bacillus thuringiensis virulence and its cry toxin genes. PLoS Biol, 13(6), e1002169.

    5. Baum, J. A., Chu, C. R., Rupar, M., Brown, G. R., Donovan, W. P., Huesing, J. E., ... & Vaughn, T. (2004). Binary toxins from Bacillus thuringiensis active against the western corn rootworm, Diabrotica virgifera virgifera LeConte. Applied and environmental microbiology, 70(8), 4889-4898.

    6. Sampson, K. S., Tomso, D. J., & Dumitru, R. V. (2012). U.S. Patent No. 8,318,900. Washington, DC: U.S. Patent and Trademark Office.

    7. Compeau, P. E., Pevzner, P. A., & Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29(11), 987-991.

    8. Peng, Y., Leung, H. C., Yiu, S. M., & Chin, F. Y. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 28(11), 1420-1428.

    9. Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.

    10. Cock PA, Antao T, Chang JT, Bradman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423

    11. Crickmore, N., Baum, J., Bravo, A., Lereclus, D., Narva, K., Sampson, K., Schnepf, E., Sun, M. and Zeigler, D.R. " Bacillus thuringiensis toxin nomenclature" (2016) http://www.btnomenclature.info/

    12. Besemer, J., Lomsadze, A., & Borodovsky, M. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic acids research, 29(12), 2607-2618.

    13. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.

    14. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755-763.

    15. Sievers, F., & Higgins, D. G. (2014). Clustal Omega, accurate alignment of very large numbers of sequences. Multiple sequence alignment methods, 105-116.

    16. Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.

    17. Guruprasad, K., Reddy, B. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein engineering, 4(2), 155-161.

    18. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.

    19. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.

    20. Lupas, A., Van Dyke, M., and Stock, J. (1991) Predicting Coiled Coils from Protein Sequences,Science 252:1162-1164.