Difference between revisions of "Team:Wageningen UR/Software"

Line 22: Line 22:
 
<section id="TS_intro">
 
<section id="TS_intro">
 
<h1 style="text-align:center;">Toxin Scanner</h1>
 
<h1 style="text-align:center;">Toxin Scanner</h1>
<h3>iGEM and sequencing</h3>     
+
<h3>BioBrick discovery</h3>     
<p>For iGEM 2016 we designed a high-throughput pipeline for the identification of novel proteins directly from raw genome sequencing data in fasta or fastq format. Given the specificity of our tool and the importance of biobrick discovery in iGEM, we made this tool publicly available for everyone to modify and use. It can be found in this <a href=https://gitlab.com/rphdejongh/bioinformatics.git style="padding-right:0px;"> Gitlab repository</a>. For the purpose of the BeeT project, we use it as a cry toxin (oid) predictor, given genomes of selected bacteria. With some modifications it has the potential to be a more broadly applicable tool for finding specific genes in newly sequenced genomes. </p>
+
<p>For iGEM 2016 we designed a high-throughput pipeline for the identification of novel proteins directly from raw genome sequencing data. Given the specificity of our tool and the importance of biobrick discovery in iGEM, we made it publicly available for everyone to modify and use. The tool can be found in this <a href=https://gitlab.com/rphdejongh/bioinformatics.git style="padding-right:0px;"> Gitlab repository </a>.  
<h3>Background</h3>
+
<br/>
<p>For this project we need a toxin specific to <i>Varroa destructor</i>. There are many insecticidal protein families, but the one we were most interested in are the "crystal" (or "cry") proteins. These are usually found on plasmids found in <i>Bacillus Thuringiensis</i> and related species. <sup><a href="#ts14" id="ref_ts14">14</a></sup>  
+
For the purpose of the BeeT project, we use it as a cry toxin predictor, given genomes of selected bacteria. </p>
 
+
<h3>Toxin Specificity</h3>
 
+
<p>For this project we need a toxin that specifically targets <i>Varroa destructor</i>. There are many insecticidal protein families, but the group we are most interested in are the crystal (cry) proteins. These are usually found on plasmids from <i>Bacillus Thuringiensis</i> and related species. <sup><a href="#ts14" id="ref_ts14">14</a></sup> </p>
We know <i>Varroa</i>-specific insecticidal activity exists in <i>Bacillus thuringiensis</i> and related species, due to a paper called: "In vitro susceptibility of <i>Varroa destructor </i>and <i>Apis mellifera</i> to native strains of <i>Bacillus thuringiensis</i>."  by Alquisira-Ramírez et al. <sup><a href="#ts2" id="ref_ts2">2</a></sup> In this paper several isolates are described that cause a mite mortality of up to 100%. Importantly the strains also showed no insecticidal activity against bee larvae. Because of this we started several sub-projects in parallel to maximize our chances of finding a viable <i>V. destructor</i>-killer. The <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity>specificity</a>  
+
<p>We know <i>Varroa</i>-specific insecticidal activity exists in <i>Bacillus thuringiensis</i> and related species, as shown in: "In vitro susceptibility of <i>Varroa destructor </i>and <i>Apis mellifera</i> to native strains of <i>Bacillus thuringiensis</i>."  by Alquisira-Ramírez et al. <sup><a href="#ts2" id="ref_ts2">2</a></sup> In this paper, several isolates are described that cause a mite mortality of up to 100%. Importantly, the strains also showed no insecticidal activity against bee larvae. Because of this we started several sub-projects in parallel to maximize our chances of finding a viable <i>V. destructor</i>-killer. The <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity>specificity</a>  
  part of our project focuses on  <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#ToxinEngineering> creating </a> <i>V. destructor</i>-binding cry toxins and <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates2> finding </a> <i>V. destructor</i>-specific insecticidal proteins.  
+
  part of our project focuses on  <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#ToxinEngineering> creating </a> <i>V. destructor</i>-binding cry toxins and <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates2> finding </a> <i>V. destructor</i>-specific insecticidal proteins. </p>
<br>
+
<p>There already exists a publication about a tool called “Bt Toxin Scanner”<sup><a href="#ts1" id="ref_ts1">1</a></sup>. This tool does not fully support local deployment, which is needed for high-throughput analysis. Also, because of the relatively basic analysis done by the tool, we decided to develop our own tool that is fully open-source and improves upon the analysis techniques used in Bt Toxin Scanner. Our goal with this tool is to run raw sequencing files, and deliver potential cry proteins with just the click of a button.</p>
This tool was made in preparation for results of the latter, finding <i>V. destructor</i>-specific insecticidal proteins, which we assume to be of the cry protein family.  
+
<p>This tool was made in preparation for results of the <a href=https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates2><i>V. destructor</i> isolates</a> part of the project. We assume these <i>V. destructor</i>-specific insecticidal proteins, to be of the cry protein family. Or at the very least quite similar to them.<br>
These cry proteins are a diverse group, but are known to be highly specific for individual insects and nematodes. Cry proteins are not necessarily a group of proteins that all perform the same function in the same manner. The distinction between cry and non-cry proteins is defined by a committee: <a href=http://www.lifesci.sussex.ac.uk/home/Neil_Crickmore/Bt/>Cry Protein website</a> Based on 45% sequence similarity there are over 70 groups, this high amount of diversity makes it hard to predict when something is or isn't a cry protein.
+
These cry proteins are a diverse group, but are known to be highly specific for individual insects and nematodes. Cry proteins are not necessarily a group of proteins that all perform the same function in the same manner. The distinction between cry and non-cry proteins is defined by a committee: <a href=http://www.lifesci.sussex.ac.uk/home/Neil_Crickmore/Bt/>Cry Protein website</a> Based on 45% sequence similarity there are over 70 groups. This high amount of diversity makes it hard to predict when something is or isn't a cry protein. </p>
 
+
<h3>Current developments</h3>
+
<p>There already exists a publication about a tool called “Bt Toxin Scanner”<sup><a href="#ts1" id="ref_ts1">1</a></sup>. This tool does not fully support local deployment, which is needed for high-throughput analysis, and the relatively basic analysis done by the tool. Due to these reasons we decided to develop our own tool that is fully open-source and improves upon the Bt Toxin Scanner. Our goal with this tool is to run raw sequencing files, and deliver potential cry proteins with just the click of a button.</p>
+
 
</section>
 
</section>
 
 
<section id="TS_results">
 
<section id="TS_results">
<h1>Pipeline validation</h1>
+
<h1>Testing the tool</h1>
<h2>PRJEB5931</h2><p>
+
<p> We tested the tool on a genome sample from a study with accession number <a href=http://www.ebi.ac.uk/ena/data/view/PRJEB5931> PRJEB5931</a>. <sup><a href="#ts13" id="ref_ts13">13</a></sup> This genome was found after a co-evolution experiment, and a <i>Bacillus thuringiensis</i> with known nematicidal cry proteins present.  
We tested the pipeline on a genome sample from a study with accession number <a href=http://www.ebi.ac.uk/ena/data/view/PRJEB5931> PRJEB5931</a>. <sup><a href="#ts13" id="ref_ts13">13</a></sup> This genome was found after a co-evolution experiment, and a <i>Bacillus thuringiensis</i> with known nematicidal cry proteins present.  
+
 
<br>
 
<br>
<a href=http://www.ebi.ac.uk/ena/data/view/ERX463573>ERX463573</a> is the accession code of the experiment from which these raw read files came. The description reads: "Population of Bacillus thuringiensis coming from 5 strains experimentally coevolved with Caenorhabditis as host" and the Name: "Coevolution G12 Pop4". From the study we know to expect at the very least the following two proteins:
+
<a href=http://www.ebi.ac.uk/ena/data/view/ERX463573>ERX463573</a> is the accession code of the experiment from which these raw read files came. The experiment this came from was about a Population of <i>Bacillus thuringiensis</i> which were coevolved with <i>Caenorhabditis</i>, which is a kind of nematode, as host.
 +
 
 +
From the study we know to expect at the very least the following two proteins:
 
<a href=https://www.ncbi.nlm.nih.gov/nucleotide/47500285> Cry35Aa4 </a>
 
<a href=https://www.ncbi.nlm.nih.gov/nucleotide/47500285> Cry35Aa4 </a>
 
and  
 
and  
 
<a href=https://www.ncbi.nlm.nih.gov/nucleotide/2724454>Cry21Aa2</a>. <br>
 
<a href=https://www.ncbi.nlm.nih.gov/nucleotide/2724454>Cry21Aa2</a>. <br>
According to the cry protein toxin list it is known that Cry35Aa4 is a binary toxin with <a href=https://www.ncbi.nlm.nih.gov/nucleotide/47500295> Cry34Aa4 </a> and as such we may expect to find this protein, or one like it, as well.
+
According to the cry protein toxin list it is known that Cry35Aa4 is a binary toxin with <a href=https://www.ncbi.nlm.nih.gov/nucleotide/47500295> Cry34Aa4 </a>, and as such we may expect to find this protein, or one like it, as well.
  
  
</p><h2>Key result</h2><p>
+
</p><h2>Visualizing the result</h2><p>
The easiest way to visualize the results from the three separate components of the pipeline is to use a <a href=http://bioinformatics.psb.ugent.be/webtools/Venn/>Venn diagram</a>.
+
The easiest way to visualize the results from the three separate components of the tool is to use a <a href=http://bioinformatics.psb.ugent.be/webtools/Venn/>Venn diagram</a>.
  
<figure><img src=https://static.igem.org/mediawiki/2016/2/2c/T--Wageningen_UR--Venn-Diagram_TS.png><figcaption>Figure 1: A Venn diagram showing the overlap of the output of the various methods used to predict whether or not certain genes from the genome were cry proteins or not.</figcaption></figure>
+
<figure><img src=https://static.igem.org/mediawiki/2016/2/2c/T--Wageningen_UR--Venn-Diagram_TS.png><figcaption>Figure 1: A Venn diagram showing the overlap of the output of the various methods used to predict whether or not certain genes from the genome are cry proteins or not.</figcaption></figure>
  
<p>As Blast had only 5 results it is easier to examine this method in detail. Two of the five were proteins we were expecting from the paper: Cry35Aa4 and Cry21Aa2. These were found to be:</p> <ul>
+
<p>As Blast had only 5 results it is easier to examine this method in detail. Two of the five were indeed proteins we expected from the paper: Cry35Aa4 and Cry21Aa2. Both of these were also picked up by the Hidden Markov Model method of cry protein detection as shown in figure 1, but not by the RandomForest method. <br>
<li>Gene_5932, which had 100.00% identity with: Cry35Aa4 </li>
+
Next we found a protein that matched very well with the Cry34Aa group, which is a complimentary protein to Cry35Aa4.<br>  
<li>Gene_5527, which had 96.04% identity with: Cry21Aa2</li>
+
The two other proteins which were found: Cry38Aa1, which has no known insecticidal target, but is highly similar to proteins that do: Cry15Aa1, Cry23Aa1, and Cry33Aa1.  <sup><a href="#ts17" id="ref_ts17">17</a></sup><br>
</ul>
+
and "Gene_5518" which was 85.87% identical to: Cry14Ab1. This protein is only mentioned in a patent by Sampson et al. 2012.<sup><a href="#ts18" id="ref_ts18">18</a></sup> This protein is quite interesting because it might be a completely new protein or a variation of Cry14Ab1 specific to nematodes.
<p>Based on the fact that Cry35Aa4 is a binary protein we expected to find its complementary protein as well, which we did:</p> <ul>
+
<li>Gene_5931, which had 100.00% identity with all four proteins in the Cry34Aa group.</li>
+
</ul>
+
<p>Two other proteins appeared with a good overlap and a good e-value, namely: <ul>
+
<li>Gene_5934, which had 98.39% identity with: Cry38Aa1, which has no known insecticidal target, but is highly similar to proteins that do: Cry15Aa1, Cry23Aa1, and Cry33Aa1.  <sup><a href="#ts17" id="ref_ts17">17</a></sup></li>
+
<li>Gene_5518, which had 85.87% identity with: Cry14Ab1. Which is only mentioned in a patent by Sampson et al. 2012. <sup><a href="#ts18" id="ref_ts18">18</a></sup> </li>
+
</ul>
+
<p>Both of these were also picked up by the Hidden Markov Model method of cry protein detection as shown in figure 1, but not by the RandomForest method. </p>
+
  
 
</section>
 
</section>
 
<section id="TS_methods">
 
<section id="TS_methods">
<h1>Methods</h1>
+
<h1>Tool overview</h1>
<h3>Overview</h3><p>
+
<p>
 
We use a combination of existing tools to come to the prediction of novel cry proteins. The entire pipeline consists of four scripts in total, one of which is entirely dedicated to analysis of the Random Forest model and not further used in the main program. The others are there to group the Machine Learning specific functions, the functions that handle known cry proteins, and the functions that handle raw sequence data. Figure 2 gives a graphical representation of the pipeline, though some modules have been left out for the sake of readability. Here, we shall go through each part of the process in a step-by-step manner.</p>
 
We use a combination of existing tools to come to the prediction of novel cry proteins. The entire pipeline consists of four scripts in total, one of which is entirely dedicated to analysis of the Random Forest model and not further used in the main program. The others are there to group the Machine Learning specific functions, the functions that handle known cry proteins, and the functions that handle raw sequence data. Figure 2 gives a graphical representation of the pipeline, though some modules have been left out for the sake of readability. Here, we shall go through each part of the process in a step-by-step manner.</p>
 
<map name="pipeline-map" id="pipeline-map">
 
<map name="pipeline-map" id="pipeline-map">
Line 82: Line 71:
 
</map>
 
</map>
 
<figure><img src=https://static.igem.org/mediawiki/2016/0/08/T--Wageningen_UR--pipeline_overview_TS.png usemap="#pipeline-map"><figcaption>Figure 2: A graphical representation of the pipeline showing the various methods and tools used. All the pink diamonds are clickable and will take you to the respective tool's homepage. The known cry proteins box will take you to the cry protein database. </figcaption></figure>
 
<figure><img src=https://static.igem.org/mediawiki/2016/0/08/T--Wageningen_UR--pipeline_overview_TS.png usemap="#pipeline-map"><figcaption>Figure 2: A graphical representation of the pipeline showing the various methods and tools used. All the pink diamonds are clickable and will take you to the respective tool's homepage. The known cry proteins box will take you to the cry protein database. </figcaption></figure>
 +
  
 
<h2>Software description</h2>
 
<h2>Software description</h2>
 +
<p onclick="javascript:ShowHide('HiddenDiv1')" style="border: 2px solid gray;">Click here for a highly detailed overview of how the pipeline works.</p>
 +
<div class="mid" id="HiddenDiv1" style="display: none; border: 2px solid gray;">
 
<h3>Genome assembly</h3>
 
<h3>Genome assembly</h3>
 
<p>Raw genome data directly from a Next Generation Sequencing device comes in the form of files containing many pieces of DNA called ‘reads’. These reads are usually obtained through cutting up the genome and sequencing the small bits at random after amplifying them many times. The idea is that many reads will overlap and can be assembled like a big puzzle. Our pipeline uses a <i>de novo</i> assembler, meaning that it can assemble the reads from scratch, without a reference genome to map them to. This is done by a  
 
<p>Raw genome data directly from a Next Generation Sequencing device comes in the form of files containing many pieces of DNA called ‘reads’. These reads are usually obtained through cutting up the genome and sequencing the small bits at random after amplifying them many times. The idea is that many reads will overlap and can be assembled like a big puzzle. Our pipeline uses a <i>de novo</i> assembler, meaning that it can assemble the reads from scratch, without a reference genome to map them to. This is done by a  
Line 152: Line 144:
 
<h3>Venn-Diagrams</h3>
 
<h3>Venn-Diagrams</h3>
 
<p>All three separate methods give a different output. In order to make an easy visual comparison, we make use of a web tool that can easily calculate and draw custom Venn-diagrams. This tool was made by the university of Gent and can be found <a href=http://bioinformatics.psb.ugent.be/webtools/Venn/>here</a>.  
 
<p>All three separate methods give a different output. In order to make an easy visual comparison, we make use of a web tool that can easily calculate and draw custom Venn-diagrams. This tool was made by the university of Gent and can be found <a href=http://bioinformatics.psb.ugent.be/webtools/Venn/>here</a>.  
 +
 +
</div>
 
</section>
 
</section>
  

Revision as of 13:51, 14 October 2016

Wageningen UR iGEM 2016

 

Toxin Scanner

BioBrick discovery

For iGEM 2016 we designed a high-throughput pipeline for the identification of novel proteins directly from raw genome sequencing data. Given the specificity of our tool and the importance of biobrick discovery in iGEM, we made it publicly available for everyone to modify and use. The tool can be found in this Gitlab repository .
For the purpose of the BeeT project, we use it as a cry toxin predictor, given genomes of selected bacteria.

Toxin Specificity

For this project we need a toxin that specifically targets Varroa destructor. There are many insecticidal protein families, but the group we are most interested in are the crystal (cry) proteins. These are usually found on plasmids from Bacillus Thuringiensis and related species. 14

We know Varroa-specific insecticidal activity exists in Bacillus thuringiensis and related species, as shown in: "In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of Bacillus thuringiensis." by Alquisira-Ramírez et al. 2 In this paper, several isolates are described that cause a mite mortality of up to 100%. Importantly, the strains also showed no insecticidal activity against bee larvae. Because of this we started several sub-projects in parallel to maximize our chances of finding a viable V. destructor-killer. The specificity part of our project focuses on creating V. destructor-binding cry toxins and finding V. destructor-specific insecticidal proteins.

There already exists a publication about a tool called “Bt Toxin Scanner”1. This tool does not fully support local deployment, which is needed for high-throughput analysis. Also, because of the relatively basic analysis done by the tool, we decided to develop our own tool that is fully open-source and improves upon the analysis techniques used in Bt Toxin Scanner. Our goal with this tool is to run raw sequencing files, and deliver potential cry proteins with just the click of a button.

This tool was made in preparation for results of the V. destructor isolates part of the project. We assume these V. destructor-specific insecticidal proteins, to be of the cry protein family. Or at the very least quite similar to them.
These cry proteins are a diverse group, but are known to be highly specific for individual insects and nematodes. Cry proteins are not necessarily a group of proteins that all perform the same function in the same manner. The distinction between cry and non-cry proteins is defined by a committee: Cry Protein website Based on 45% sequence similarity there are over 70 groups. This high amount of diversity makes it hard to predict when something is or isn't a cry protein.

Testing the tool

We tested the tool on a genome sample from a study with accession number PRJEB5931. 13 This genome was found after a co-evolution experiment, and a Bacillus thuringiensis with known nematicidal cry proteins present.
ERX463573 is the accession code of the experiment from which these raw read files came. The experiment this came from was about a Population of Bacillus thuringiensis which were coevolved with Caenorhabditis, which is a kind of nematode, as host. From the study we know to expect at the very least the following two proteins: Cry35Aa4 and Cry21Aa2.
According to the cry protein toxin list it is known that Cry35Aa4 is a binary toxin with Cry34Aa4 , and as such we may expect to find this protein, or one like it, as well.

Visualizing the result

The easiest way to visualize the results from the three separate components of the tool is to use a Venn diagram.

Figure 1: A Venn diagram showing the overlap of the output of the various methods used to predict whether or not certain genes from the genome are cry proteins or not.

As Blast had only 5 results it is easier to examine this method in detail. Two of the five were indeed proteins we expected from the paper: Cry35Aa4 and Cry21Aa2. Both of these were also picked up by the Hidden Markov Model method of cry protein detection as shown in figure 1, but not by the RandomForest method.
Next we found a protein that matched very well with the Cry34Aa group, which is a complimentary protein to Cry35Aa4.
The two other proteins which were found: Cry38Aa1, which has no known insecticidal target, but is highly similar to proteins that do: Cry15Aa1, Cry23Aa1, and Cry33Aa1. 17
and "Gene_5518" which was 85.87% identical to: Cry14Ab1. This protein is only mentioned in a patent by Sampson et al. 2012.18 This protein is quite interesting because it might be a completely new protein or a variation of Cry14Ab1 specific to nematodes.

Tool overview

We use a combination of existing tools to come to the prediction of novel cry proteins. The entire pipeline consists of four scripts in total, one of which is entirely dedicated to analysis of the Random Forest model and not further used in the main program. The others are there to group the Machine Learning specific functions, the functions that handle known cry proteins, and the functions that handle raw sequence data. Figure 2 gives a graphical representation of the pipeline, though some modules have been left out for the sake of readability. Here, we shall go through each part of the process in a step-by-step manner.

idba_ud genemark cry protein hmmscan blastp RandomForest
Figure 2: A graphical representation of the pipeline showing the various methods and tools used. All the pink diamonds are clickable and will take you to the respective tool's homepage. The known cry proteins box will take you to the cry protein database.

Software description

Click here for a highly detailed overview of how the pipeline works.

References

1. Ye, W., Zhu, L., Liu, Y., Crickmore, N., Peng, D., Ruan, L., & Sun, M. (2012). Mining new crystal protein genes from Bacillus thuringiensis on the basis of mixed plasmid-enriched genome sequencing and a computational pipeline. Applied and environmental microbiology, 78(14), 4795-4801.

2. Alquisira-Ramírez, E. V., Paredes-Gonzalez, J. R., Hernández-Velázquez, V. M., Ramírez-Trujillo, J. A., & Peña-Chora, G. (2014). In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of Bacillus thuringiensis. Apidologie, 45(6), 707-718.

3. Compeau, P. E., Pevzner, P. A., & Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29(11), 987-991.

4. Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.

5. Crickmore, N., Baum, J., Bravo, A., Lereclus, D., Narva, K., Sampson, K., Schnepf, E., Sun, M. and Zeigler, D.R. " Bacillus thuringiensis toxin nomenclature" (2016) http://www.btnomenclature.info/

6. Cock PA, Antao T, Chang JT, Bradman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423

7. Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.

8. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.

9. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755-763.

10. Sievers, F., & Higgins, D. G. (2014). Clustal Omega, accurate alignment of very large numbers of sequences. Multiple sequence alignment methods, 105-116.

11. Guruprasad, K., Reddy, B. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein engineering, 4(2), 155-161.

12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.

13. Masri, L., Branca, A., Sheppard, A. E., Papkou, A., Laehnemann, D., Guenther, P. S., ... & Brzuszkiewicz, E. (2015). Host–pathogen coevolution: the selective advantage of Bacillus thuringiensis virulence and its cry toxin genes. PLoS Biol, 13(6), e1002169.

14.de Maagd, R. A., Bravo, A., & Crickmore, N. (2001). How Bacillus thuringiensis has evolved specific toxins to colonize the insect world. TRENDS in Genetics, 17(4), 193-199.

15. Peng, Y., Leung, H. C., Yiu, S. M., & Chin, F. Y. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 28(11), 1420-1428.

16. Besemer, J., Lomsadze, A., & Borodovsky, M. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic acids research, 29(12), 2607-2618.

17. Baum, J. A., Chu, C. R., Rupar, M., Brown, G. R., Donovan, W. P., Huesing, J. E., ... & Vaughn, T. (2004). Binary toxins from Bacillus thuringiensis active against the western corn rootworm, Diabrotica virgifera virgifera LeConte. Applied and environmental microbiology, 70(8), 4889-4898.

18. Sampson, K. S., Tomso, D. J., & Dumitru, R. V. (2012). U.S. Patent No. 8,318,900. Washington, DC: U.S. Patent and Trademark Office.

19. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.