Difference between revisions of "Team:Wageningen UR/Software"

Line 4: Line 4:
 
<div class="menu-head">
 
<div class="menu-head">
 
<h4><a href="#header">Toxin Scanner</a></h4>
 
<h4><a href="#header">Toxin Scanner</a></h4>
 +
<li class="menu-item">
 +
<a href="#TS_intro">Introduction</a>
 +
<li class="menu-item">
 +
<a href="#TS_methods">Methods</a>
 +
</li> 
 +
<li class="menu-item">
 +
<a href="#TS_results">Results</a>
 +
</li> 
 +
<li class="menu-item">
 +
<a href="#TS_discussion">Discussion</a>
 +
</li>
 
</div>
 
</div>
 
</html>
 
</html>
 
{{Wageningen_UR/menu}}
 
{{Wageningen_UR/menu}}
 
<html>
 
<html>
<h1><b>Toxin Scanner</b></h1>
+
<section id="TS_intro">
<p> YOUR TEXT HERE
+
<h1>Introduction</h1>
 +
<h3>iGEM and sequencing</h3>   
 +
<p>For iGEM 2016 we designed a high-throughput pipeline for the identification of novel proteins directly from raw genome sequencing data in fasta or fastq format. Given the specificity of our tool (link naar resultaat), and the importance of biobrick discovery in iGEM, we made this tool publicly available for everyone to modify and use. It can be found in this <a href=https://gitlab.com/rphdejongh/bioinformatics.git> Gitlab repository </a>.
 +
 
 +
For the purpose of the BeeT project, we use it as a cry toxin (oid) predictor, given genomes of selected bacteria. </p>
 +
<h3>Background</h3>
 +
<p>For this project we needed a toxin specific to <i>Varroa destructor</i>. There are many insecticidal protein families, but the one we were most interested in are the "crystal" (or 'cry') proteins. These are usually found on plasmids found in <i>Bacillus Thuringiensis</i> and related species. <sup><a href="#ts14" id="ref_ts14">14</a></sup>
 +
 
 +
 
 +
We know varroa-specific insecticidal activity exist in <i>Bacillus thuringiensis</i> and related species, due to a paper called: "In vitro susceptibility of <i>Varroa destructor </i>and <i>Apis mellifera</i> to native strains of <i>Bacillus thuringiensis</i>."  by Alquisira-Ramírez et al. <sup><a href="#ts2" id="ref_ts2">2</a></sup> In this paper several isolates are described that cause a mite mortality of up to 100% mite mortality. Importantly the strains also showed no insecticidal activity against bee larvae! Because of this we started several sub-projects in parallel to maximize our chances of finding a viable <i>V. destructor</i>-killer. The <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity>specificity</a>
 +
part of our project focusses on  <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#ToxinEngineering> creating </a> <i>V. destructor</i>-binding cry toxins and <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates2> finding </a> <i>V. destructor</i>-specific insecticidal proteins.
 +
<br>
 +
This tool was made in preparation for results of the latter, finding <i>V. destructor</i>-specific insecticidal proteins, which we assume to be of the cry protein family.
 +
These cry proteins are a diverse group, but are known to be highly specific for individual insects and nematodes. Cry proteins are not necessarily a group of proteins that all perform the same function in the same manner. The distinction between cry and not a cry protein is defined by a committee: <a href=http://www.lifesci.sussex.ac.uk/home/Neil_Crickmore/Bt/>Cry Protein website</a> Based on 45% sequence similarity there are over 70 groups, this high amount of diversity makes it hard to predict when something is or isn't a cry protein.
 +
 
 +
<h3>Current developments</h3>
 +
<p>There already exists publication about a tool called “Bt Toxin Scanner”<sup><a href="#ts1" id="ref_ts1">1</a></sup>, which inspired us to make our own tool. Due to the tool not being fully supported for local deployment, which is needed for high-throughput analysis, and the relatively basic analysis done by the tool, we decided to develop our own tool that is fully open-source and improves upon the Bt Toxin Scanner. Our goal with this tool is to run raw sequencing files, and deliver potential cry proteins with just the click of a button.</p>
 +
</section>
 +
 
 +
<section id="TS_methods">
 +
<h1>Methods</h1>
 +
<h3>Overview</h3><p>
 +
The entire pipeline consists of four scripts in total, one of which is entirely dedicated to analysis of the Random Forest model and not further used in the main program. The others are there to group the Machine Learning specific functions, the functions that handle known cry proteins, and the functions that handle raw sequence data. Figure 1 gives a graphical representation of the pipeline, though some modules have been left out for readability's sake.
 +
 
 +
<figure><img src="Figure not yet uploaded."><figcaption>Figure 1: A graphical representation of the pipeline showing the various methods and tools used.</figcaption></figure>
 +
</p>
 +
<h2>Stepwise description</h2>
 +
<h3>Genome assembly</h3>
 +
<p>Raw genome data directly from a Next Generation Sequencing device comes in the form of files containing many pieces of DNA called ‘reads’. These reads are usually obtained through cutting up the genome and sequencing the small bits at random after amplifying them many times. The idea is that many pieces will overlap and so the reads can be assembled like a big puzzle. Our pipeline uses a <i>De Novo</i> assembler, meaning that it can assemble the reads from scratch, without a reference to map them to. It does this using a 'graph based approach.'<sup><a href="#ts3" id="ref_ts3">3</a></sup> 
 +
Once the genome is assembled into 'contigs' we can start doing gene prediction.
 +
</p>
 +
<h3>Gene prediction</h3>
 +
<p>The pipeline does gene prediction through using Hidden Markov Models (HMM) <sup><a href="#ts4" id="ref_ts4">4</a></sup> 
 +
that have been trained to distinguish between prokaryotic coding and non-coding regions. These regions we then translate using BioPython's <sup><a href="#ts6" id="ref_ts6">6</a></sup> 
 +
translate function. These are the proteins we compare to our existing  <a href=http://www.lifesci.sussex.ac.uk/home/Neil_Crickmore/Bt/toxins2.html> Cry Protein database</a> <sup><a href="#ts5" id="ref_ts5">5</a></sup> 
 +
</p>
 +
<h3>Machine Learning </h3>
 +
<p>As cry proteins are a group that is hard to define biologically, other than having specific insecticidal activity, it is somewhat of a challenge to find them through means other than direct sequence comparisons. This is why we use three methods to find potential cry proteins: <ol>
 +
<li>A direct sequence comparison with a stringent cut-off: "Blast".</li>
 +
<li>A probabilistic approach based on the primary subdivision between cry proteins (45% sequence similarity): "HMM". <sup><a href="#ts4" id="ref_ts4">4</a></sup>  </li>
 +
<li>A machine learning approach based on protein features using many weak classifiers to give a majority vote: "RandomForest".<sup><a href="#ts7" id="ref_ts7">7</a></sup></li>
 +
</ol></p>
 +
<h3>Output </h3>
 +
<p>All three separate methods give slightly different output. In order to make an easy visual comparison a webtool was found that can easily calculate and draw custom Venn diagrams. This tool was made by the university of Gent and can be found <a href=http://bioinformatics.psb.ugent.be/webtools/Venn/>here</a>.
 
</p>
 
</p>
 +
<h2>Tools Used</h2>
 +
<h3>IDBA-UD</h3>
 +
<p>The Iterative De Bruijn Graph De Novo Assembler for highly Uneven sequencing Depth (<a href=http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/>IDBA-UD</a>) is a tool for assembling genomes in a de novo manner. This means that it can take raw sequencing data and use it to create longer stretches of overlapping DNA called contigs.
 +
</p><h3>Gene Mark</h3>
 +
<p><a href=http://exon.gatech.edu/GeneMark/> GeneMark</a> is gene prediction software, making use of HMM. This will allow us to analyze novel genomic sequences.
 +
</p><h3>Blast</h3>
 +
<p>The Basic Local Alignment Search Tool (<b>BLAST</b>) is an algorithm for comparing sequence information, such as the amino-acid (protein) or nucleic-acid (DNA/RNA) sequences. <sup><a href="#ts8" id="ref_ts8">8</a></sup> Using a database of only cry proteins, this tool will allow us to rapidly identify other proteins with highly similar sequences.
 +
</p><h3>HMMER</h3><p>
 +
The HMMER toolkit contains several programs that the pipeline uses. <sup><a href="#ts9" id="ref_ts9">9</a></sup> The program first builds many different profiles based on the multiple sequence alignment of the primary cry protein groups. This is handled by Clustal Omega. <sup><a href="#ts10" id="ref_ts10">10</a></sup>
 +
<ul>
 +
<li>HMMbuild: Builds a profile HMM from an input multiple alignment.</li>
 +
<li>HMMpress: Formats an HMM database into a binary format for hmmscan.</li>
 +
<li>HMMscan: Searches a protein sequence against a protein profile HMM database. </li>
 +
</ul>
 +
 +
</p><h3>Protein Classification</h3><p>
 +
The protein classification part of the machine learning is done by the pipeline in two steps: First the features of a protein are predicted using Biopython <sup><a href="#ts6" id="ref_ts6">6</a></sup> , and then the pre-trained model predicts a class based on those features.
 +
</p><h4>Features</h4><p>
 +
After some experimentation with the features chosen for modelling, we came up with the following list which we think will capture some structural elements and allow for good predictions. <ul>
 +
<li>Molecular weight of the protein.</li>
 +
<li>Aromaticity, ie the relative frequency of aromatic amino acids (Phenylalanine, Tryptophan and Tyrosine)</li>
 +
<li>Instability index, ie the relative frequency of instability causing dipeptides, for more information see <a href=http://peds.oxfordjournals.org/content/4/2/155>Guruprasad et al 1990.</a><sup><a href="#ts11" id="ref_ts11">11</a></sup> </li>
 +
<li>Isoelectric point, which is the pH at which a particular molecule carries no net electrical charge.</li>
 +
<li>Alpha Helix fraction, the relative frequency of amino acids that tend to be in found in the secondary structure known as alpha helices. </li>
 +
<li>Turn fraction, the relative frequency of amino acids that tend to be in found in the secondary structure known as turns.</li>
 +
<li>Beta Sheet fraction, the relative frequency of amino acids that tend to be in found in the secondary structure known as beta sheets.</li>
 +
</ul>
 +
</p><h3>Scikit-Learn</h3><p>
 +
We made the classification model using scikit-learn or sklearn.<sup><a href="#ts12" id="ref_ts12">12</a></sup> The class used for this model is called the <a href=http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>RandomForestClassifier</a> which works by fitting a number of decision trees, in our case 2000, on various sub-samples of the dataset and averaging to improve accuracy and control overfitting. We assess the predictive accuracy of our model by doing a stratified 10-fold crossvalidation technique from the <a href=http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html>StratifiedKFold</a> class. These 10 crossvalidation runs were assessed with the <a href=http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html>ROC-curves</a> method.
 +
 
</section>
 
</section>
 +
 +
<section id="TS_results">
 +
<h1>Results / 'validation'</h1>
 +
<h2>PRJEB5931</h2><p>
 +
We tested the pipeline on a genome sample from a study with accession number: <a href=http://www.ebi.ac.uk/ena/data/view/PRJEB5931> PRJEB5931</a>. <sup><a href="#ts13" id="ref_ts13">13</a></sup> This genome was found after a co-evolution experiment and  a bt with known nematicidal cry proteins present.
 +
 +
</p><h2>Key result</h2><p>
 +
The easiest way to visualize what the three seperate components of the pipeline have done is to compare their results in a <a href=http://bioinformatics.psb.ugent.be/webtools/Venn/>venn diagram</a> (with accompanying lists).
 +
 +
 +
<figure><img src="Figure not yet uploaded."><figcaption>Figure 2: A venn diagram showing the overlap of the output of the various methods used to  predict whether or not certain genes from the genome were cry proteins or not.</figcaption></figure>
 +
 +
 +
</section><section id="TS_discussion">
 +
<h1>Discussion</h1>
 +
<p>It seems to work, but the definition of cry protein is vague, so the ML could be further optimized based on a more rigorous examination of what is and what is not a cry protein.</p>
 +
</section><h1>References</h1>
 +
<a id="ts1" href=http://aem.asm.org/content/78/14/4795.short>1.</a> Ye, W., Zhu, L., Liu, Y., Crickmore, N., Peng, D., Ruan, L., & Sun, M. (2012). Mining new crystal protein genes from Bacillus thuringiensis on the basis of mixed plasmid-enriched genome sequencing and a computational pipeline. Applied and environmental microbiology, 78(14), 4795-4801.
 +
<a href="#ts1" title="Jump back to footnote 1 in the text.">↩</a>
 +
<br><br>
 +
<a id="ts2" href=http://link.springer.com/article/10.1007/s13592-014-0288-z>2.</a> Alquisira-Ramírez, E. V., Paredes-Gonzalez, J. R., Hernández-Velázquez, V. M., Ramírez-Trujillo, J. A., & Peña-Chora, G. (2014). In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of Bacillus thuringiensis. Apidologie, 45(6), 707-718.
 +
<a href="#ts2" title="Jump back to footnote 2 in the text.">↩</a>
 +
<br><br>
 +
<a id="ts3" href= http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html>3.</a> Compeau, P. E., Pevzner, P. A., & Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29(11), 987-991.
 +
<a href="#ts3" title="Jump back to footnote 3 in the text.">↩</a>
 +
<br><br>
 +
<a id="ts4" href=http://ieeexplore.ieee.org/document/1165342/>4.</a> Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.
 +
<a href="#ts4" title="Jump back to footnote 4 in the text.">↩</a>
 +
<br><br>
 +
<a id="ts5" href=http://www.btnomenclature.info/  >5.</a>  Crickmore, N., Baum, J., Bravo, A., Lereclus, D., Narva, K., Sampson, K., Schnepf, E., Sun, M. and Zeigler, D.R. "Bacillus thuringiensis toxin nomenclature" (2016)
 +
http://www.btnomenclature.info/
 +
<a href="#ts5" title="Jump back to footnote 5 in the text.">↩</a>
 +
<br><br>
 +
<a id="ts6" href=http://dx.doi.org/10.1093/bioinformatics/btp163 >6.</a>  Cock PA, Antao T, Chang JT, Bradman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
 +
<a href="#ts6" title="Jump back to footnote 6 in the text.">↩</a>
 +
<br><br>
 +
<a id="ts7" href=ftp://131.252.97.79/Transfer/Treg/WFRE_Articles/Liaw_02_Classification%20and%20regression%20by%20randomForest.pdf >7.</a>  Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.
 +
<a href="#ts7" title="Jump back to footnote 7 in the text.">↩</a>
 +
<br><br>
 +
<a id="ts8" href= >@.</a>  Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.
 +
<a href="#ts8" title="Jump back to footnote 8 in the text.">↩</a>
 +
<br><br>
 +
<a id="ts9" href=www.hmmr.org >9.</a>  Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755-763.
 +
<a href="#ts9" title="Jump back to footnote 9 in the text.">↩</a>
 +
<br><br>
 +
<a id="ts10" href=http://link.springer.com/protocol/10.1007/978-1-62703-646-7_6  >10.</a>  Sievers, F., & Higgins, D. G. (2014). Clustal Omega, accurate alignment of very large numbers of sequences. Multiple sequence alignment methods, 105-116.
 +
<a href="#ts10 " title="Jump back to footnote 10  in the text.">↩</a>
 +
<br><br>
 +
<a id="ts11" href=http://peds.oxfordjournals.org/content/4/2/155  >11.</a>  Guruprasad, K., Reddy, B. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein engineering, 4(2), 155-161.
 +
<a href="#ts11 " title="Jump back to footnote 11  in the text.">↩</a>
 +
<br><br>
 +
<a id="ts12" href=http://scikit-learn.org/stable/  >12.</a>  Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.
 +
<a href="#ts12 " title="Jump back to footnote 12  in the text.">↩</a>
 +
<br><br>
 +
<a id="ts13" href= http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002169 >13.</a>  Masri, L., Branca, A., Sheppard, A. E., Papkou, A., Laehnemann, D., Guenther, P. S., ... & Brzuszkiewicz, E. (2015). Host–pathogen coevolution: the selective advantage of Bacillus thuringiensis virulence and its cry toxin genes. PLoS Biol, 13(6), e1002169.
 +
<a href="#ts13 " title="Jump back to footnote 13  in the text.">↩</a>
 +
<br><br>
 +
<a id="ts14" href=  >14.</a>de Maagd, R. A., Bravo, A., & Crickmore, N. (2001). How Bacillus thuringiensis has evolved specific toxins to colonize the insect world. TRENDS in Genetics, 17(4), 193-199.
 +
<a href="#ts14 " title="Jump back to footnote 14  in the text.">↩</a><br><br>
 +
<a id="ts@" href=  >@.</a> 
 +
<a href="#ts@ " title="Jump back to footnote @  in the text.">↩</a>
 +
<br><br>
 +
<a id="ts@" href=  >@.</a> 
 +
<a href="#ts@ " title="Jump back to footnote @  in the text.">↩</a>
 +
<br><br>
 +
<a id="ts@" href=  >@.</a> 
 +
<a href="#ts@ " title="Jump back to footnote @  in the text.">↩</a>
 +
<br><br>
 +
<a id="ts@" href=  >@.</a> 
 +
<a href="#ts@ " title="Jump back to footnote @  in the text.">↩</a>
 +
<br><br>
 +
<a id="ts@" href=  >@.</a> 
 +
<a href="#ts@ " title="Jump back to footnote @  in the text.">↩</a>
 +
 +
 +
Stuff I'm not sure what to do with:
 +
<h3>Side features</h3><p>
 +
has overwrite protection
 +
can be multi threaded
 +
creates its own directory structure
 +
Has tools to do detailed analysis of the random forest/machine learning model created(plotting decision boundaries, interpreting specific predictions)
 +
The tool can also be modified to search and compare other sets of genes against raw genome sequences. This would be done by providing different input and modifying some of the internal code handling filenames. </p>
 +
 +
<h4>Exploring cry protein features </h4>
 +
<p>Creating sub divisions based on clustering, PCA and kmeans clustering show that the only feature where  to be made on the simple features.  Other than the proteins clustering by protein length/molecular weight, there do not seem to be clear clusters to be made based on the basic features we chose for identifying cry toxins. </p>
 +
  
 
</html>
 
</html>
 
{{Wageningen_UR/footer}}
 
{{Wageningen_UR/footer}}

Revision as of 16:06, 10 October 2016

Wageningen UR iGEM 2016

 

Introduction

iGEM and sequencing

For iGEM 2016 we designed a high-throughput pipeline for the identification of novel proteins directly from raw genome sequencing data in fasta or fastq format. Given the specificity of our tool (link naar resultaat), and the importance of biobrick discovery in iGEM, we made this tool publicly available for everyone to modify and use. It can be found in this Gitlab repository . For the purpose of the BeeT project, we use it as a cry toxin (oid) predictor, given genomes of selected bacteria.

Background

For this project we needed a toxin specific to Varroa destructor. There are many insecticidal protein families, but the one we were most interested in are the "crystal" (or 'cry') proteins. These are usually found on plasmids found in Bacillus Thuringiensis and related species. 14 We know varroa-specific insecticidal activity exist in Bacillus thuringiensis and related species, due to a paper called: "In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of Bacillus thuringiensis." by Alquisira-Ramírez et al. 2 In this paper several isolates are described that cause a mite mortality of up to 100% mite mortality. Importantly the strains also showed no insecticidal activity against bee larvae! Because of this we started several sub-projects in parallel to maximize our chances of finding a viable V. destructor-killer. The specificity part of our project focusses on creating V. destructor-binding cry toxins and finding V. destructor-specific insecticidal proteins.
This tool was made in preparation for results of the latter, finding V. destructor-specific insecticidal proteins, which we assume to be of the cry protein family. These cry proteins are a diverse group, but are known to be highly specific for individual insects and nematodes. Cry proteins are not necessarily a group of proteins that all perform the same function in the same manner. The distinction between cry and not a cry protein is defined by a committee: Cry Protein website Based on 45% sequence similarity there are over 70 groups, this high amount of diversity makes it hard to predict when something is or isn't a cry protein.

Current developments

There already exists publication about a tool called “Bt Toxin Scanner”1, which inspired us to make our own tool. Due to the tool not being fully supported for local deployment, which is needed for high-throughput analysis, and the relatively basic analysis done by the tool, we decided to develop our own tool that is fully open-source and improves upon the Bt Toxin Scanner. Our goal with this tool is to run raw sequencing files, and deliver potential cry proteins with just the click of a button.

Methods

Overview

The entire pipeline consists of four scripts in total, one of which is entirely dedicated to analysis of the Random Forest model and not further used in the main program. The others are there to group the Machine Learning specific functions, the functions that handle known cry proteins, and the functions that handle raw sequence data. Figure 1 gives a graphical representation of the pipeline, though some modules have been left out for readability's sake.

Figure 1: A graphical representation of the pipeline showing the various methods and tools used.

Stepwise description

Genome assembly

Raw genome data directly from a Next Generation Sequencing device comes in the form of files containing many pieces of DNA called ‘reads’. These reads are usually obtained through cutting up the genome and sequencing the small bits at random after amplifying them many times. The idea is that many pieces will overlap and so the reads can be assembled like a big puzzle. Our pipeline uses a De Novo assembler, meaning that it can assemble the reads from scratch, without a reference to map them to. It does this using a 'graph based approach.'3 Once the genome is assembled into 'contigs' we can start doing gene prediction.

Gene prediction

The pipeline does gene prediction through using Hidden Markov Models (HMM) 4 that have been trained to distinguish between prokaryotic coding and non-coding regions. These regions we then translate using BioPython's 6 translate function. These are the proteins we compare to our existing Cry Protein database 5

Machine Learning

As cry proteins are a group that is hard to define biologically, other than having specific insecticidal activity, it is somewhat of a challenge to find them through means other than direct sequence comparisons. This is why we use three methods to find potential cry proteins:

  1. A direct sequence comparison with a stringent cut-off: "Blast".
  2. A probabilistic approach based on the primary subdivision between cry proteins (45% sequence similarity): "HMM". 4
  3. A machine learning approach based on protein features using many weak classifiers to give a majority vote: "RandomForest".7

Output

All three separate methods give slightly different output. In order to make an easy visual comparison a webtool was found that can easily calculate and draw custom Venn diagrams. This tool was made by the university of Gent and can be found here.

Tools Used

IDBA-UD

The Iterative De Bruijn Graph De Novo Assembler for highly Uneven sequencing Depth (IDBA-UD) is a tool for assembling genomes in a de novo manner. This means that it can take raw sequencing data and use it to create longer stretches of overlapping DNA called contigs.

Gene Mark

GeneMark is gene prediction software, making use of HMM. This will allow us to analyze novel genomic sequences.

Blast

The Basic Local Alignment Search Tool (BLAST) is an algorithm for comparing sequence information, such as the amino-acid (protein) or nucleic-acid (DNA/RNA) sequences. 8 Using a database of only cry proteins, this tool will allow us to rapidly identify other proteins with highly similar sequences.

HMMER

The HMMER toolkit contains several programs that the pipeline uses. 9 The program first builds many different profiles based on the multiple sequence alignment of the primary cry protein groups. This is handled by Clustal Omega. 10

  • HMMbuild: Builds a profile HMM from an input multiple alignment.
  • HMMpress: Formats an HMM database into a binary format for hmmscan.
  • HMMscan: Searches a protein sequence against a protein profile HMM database.

Protein Classification

The protein classification part of the machine learning is done by the pipeline in two steps: First the features of a protein are predicted using Biopython 6 , and then the pre-trained model predicts a class based on those features.

Features

After some experimentation with the features chosen for modelling, we came up with the following list which we think will capture some structural elements and allow for good predictions.

  • Molecular weight of the protein.
  • Aromaticity, ie the relative frequency of aromatic amino acids (Phenylalanine, Tryptophan and Tyrosine)
  • Instability index, ie the relative frequency of instability causing dipeptides, for more information see Guruprasad et al 1990.11
  • Isoelectric point, which is the pH at which a particular molecule carries no net electrical charge.
  • Alpha Helix fraction, the relative frequency of amino acids that tend to be in found in the secondary structure known as alpha helices.
  • Turn fraction, the relative frequency of amino acids that tend to be in found in the secondary structure known as turns.
  • Beta Sheet fraction, the relative frequency of amino acids that tend to be in found in the secondary structure known as beta sheets.

Scikit-Learn

We made the classification model using scikit-learn or sklearn.12 The class used for this model is called the RandomForestClassifier which works by fitting a number of decision trees, in our case 2000, on various sub-samples of the dataset and averaging to improve accuracy and control overfitting. We assess the predictive accuracy of our model by doing a stratified 10-fold crossvalidation technique from the StratifiedKFold class. These 10 crossvalidation runs were assessed with the ROC-curves method.

Results / 'validation'

PRJEB5931

We tested the pipeline on a genome sample from a study with accession number: PRJEB5931. 13 This genome was found after a co-evolution experiment and a bt with known nematicidal cry proteins present.

Key result

The easiest way to visualize what the three seperate components of the pipeline have done is to compare their results in a venn diagram (with accompanying lists).

Figure 2: A venn diagram showing the overlap of the output of the various methods used to predict whether or not certain genes from the genome were cry proteins or not.

Discussion

It seems to work, but the definition of cry protein is vague, so the ML could be further optimized based on a more rigorous examination of what is and what is not a cry protein.

References

1. Ye, W., Zhu, L., Liu, Y., Crickmore, N., Peng, D., Ruan, L., & Sun, M. (2012). Mining new crystal protein genes from Bacillus thuringiensis on the basis of mixed plasmid-enriched genome sequencing and a computational pipeline. Applied and environmental microbiology, 78(14), 4795-4801.

2. Alquisira-Ramírez, E. V., Paredes-Gonzalez, J. R., Hernández-Velázquez, V. M., Ramírez-Trujillo, J. A., & Peña-Chora, G. (2014). In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of Bacillus thuringiensis. Apidologie, 45(6), 707-718.

3. Compeau, P. E., Pevzner, P. A., & Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29(11), 987-991.

4. Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.

5. Crickmore, N., Baum, J., Bravo, A., Lereclus, D., Narva, K., Sampson, K., Schnepf, E., Sun, M. and Zeigler, D.R. "Bacillus thuringiensis toxin nomenclature" (2016) http://www.btnomenclature.info/

6. Cock PA, Antao T, Chang JT, Bradman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423

7. Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.

@. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.

9. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755-763.

10. Sievers, F., & Higgins, D. G. (2014). Clustal Omega, accurate alignment of very large numbers of sequences. Multiple sequence alignment methods, 105-116.

11. Guruprasad, K., Reddy, B. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein engineering, 4(2), 155-161.

12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.

13. Masri, L., Branca, A., Sheppard, A. E., Papkou, A., Laehnemann, D., Guenther, P. S., ... & Brzuszkiewicz, E. (2015). Host–pathogen coevolution: the selective advantage of Bacillus thuringiensis virulence and its cry toxin genes. PLoS Biol, 13(6), e1002169.

14.de Maagd, R. A., Bravo, A., & Crickmore, N. (2001). How Bacillus thuringiensis has evolved specific toxins to colonize the insect world. TRENDS in Genetics, 17(4), 193-199.

@.

@.

@.

@.

@. Stuff I'm not sure what to do with:

Side features

has overwrite protection can be multi threaded creates its own directory structure Has tools to do detailed analysis of the random forest/machine learning model created(plotting decision boundaries, interpreting specific predictions) The tool can also be modified to search and compare other sets of genes against raw genome sequences. This would be done by providing different input and modifying some of the internal code handling filenames.

Exploring cry protein features

Creating sub divisions based on clustering, PCA and kmeans clustering show that the only feature where to be made on the simple features. Other than the proteins clustering by protein length/molecular weight, there do not seem to be clear clusters to be made based on the basic features we chose for identifying cry toxins.