Difference between revisions of "Team:Wageningen UR/Software"

Line 150: Line 150:
 
<p onclick="javascript:ShowHide('HiddenDiv1')" style="border: 2px solid gray;">Click here to open or close the overview.</p>
 
<p onclick="javascript:ShowHide('HiddenDiv1')" style="border: 2px solid gray;">Click here to open or close the overview.</p>
 
</section>
 
</section>
 
 
<section id="TS_results">
 
<section id="TS_results">
 
<h1><i>Varroa</i> Isolate results</h1>
 
<h1><i>Varroa</i> Isolate results</h1>
<p> Given the Next Generation Sequencing (NGS) results from the <a href="https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates4"><i>Varroa</i> isolates experiment</a> which appeared to be a close match to <i>Lysinibacillus</i> according to the 16S analysis results as seen in [lisa notebook#october]. The pipeline was able to piece together the genome into 822 contigs, from more than 17.5 million individual reads. More statistics about the assembly are found below:
+
<p> The <a href="https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates4"><i>Varroa</i> isolates experiment</a> found an isolate of interest which warrented more analysis. This potential <i>Varroa</i>-killing bacteria was sequenced using Next Generation Sequencing (NGS). According to the 16S analysis results as seen in <a href="https://2016.igem.org/Team:Wageningen_UR/Notebook/VarroaIsolates">Lisa's notebook</a> this isolate appeared to be a close match to <i>Lysinibacillus</i>.
</p>
+
From over 17.5 million reads, the pipeline assembled the sequencing results into a genome consisting of 822 contigs. From this assembled genome the pipeline was able to find 4,427 genes, of which 4 were identified as potential Cry proteins. These were further investigated. For the results, click on the button below. More statistics about the assembly can be found in Table 1.</p>
<ul>
+
<li>Total reads: 17,579,690 </li>
+
<li>Number of aligned reads: 17,078,201 </li>
+
<li>Expected genome coverage: 4.35088 </li>
+
<li>De Bruijn Graph edges: 112 </li>
+
<li>Total contigs: 822 </li>
+
<li><a class="tooltip">n50 value<span class="tooltiptext" style="width:500px;">
+
The N50 value is a weighted median statistic such that 50% of the entire assembly is contained in contigs of larger size than this number. </span></a>: 163,362</li>
+
<li>Largest contig length: 504,583</li>
+
<li>Mean contig length: 4,905</li>
+
<li>Total genome length: 4,032,210 </li>
+
</ul>
+
  
<p>
+
<figure><figcaption>Table 1. Genome Assembly statistics from the <a href="https://2016.igem.org/Team:Wageningen_UR/Description/Specificity#Isolates4"><i>Varroa</i> isolates experiment</a> </figcaption></figure>
From this assembled genome the pipeline was able to find 4,4427 genes. From this entire list only 4 were found by the pipeline to be potential Cry proteins. These were investigated and the results of which can be viewed by clicking on the button below.
+
<table>
</p>
+
<tbody>
<br/>
+
<tr>
<a href="https://static.igem.org/mediawiki/2016/9/97/T--Wageningen_UR--pipeline_NGS_result_inspection.pdf">
+
<td >Total reads</td>
<figure><img src="https://static.igem.org/mediawiki/2016/3/3a/T--Wageningen_UR--toplogobutton.jpg"/></a><figcaption>Click the button to go to the screenshot! </figcaption></figure>
+
<td>17,579,690</td>
 +
</tr>
 +
<tr>
 +
<td>Number of aligned reads</td>
 +
<td>17,078,201</td>
 +
</tr>
 +
<tr>
 +
<td>Expected genome coverage</td>
 +
<td>4.35088</td>
 +
</tr>
 +
<tr>
 +
<td>De Bruijn Graph edges</td>
 +
<td>112</td>
 +
</tr>
 +
<tr>
 +
<td>Total contigs</td>
 +
<td>822</td>
 +
</tr>
 +
<tr>
 +
<td>n50 value</td>
 +
<td>163,362</td>
 +
</tr>
 +
<tr>
 +
<td>Largest contig length</td>
 +
<td>504,583</td>
 +
</tr>
 +
<tr>
 +
<td>Mean contig length</td>
 +
<td>4,905</td>
 +
</tr>
 +
<tr>
 +
<td>Total genome length</td>
 +
<td>4,032,210</td>
 +
</tr>
 +
</tbody>
 +
</table>
  
<p>The button links to a pdf which shows 3 screenshots of further examination of "gene_678", which was found to be a potential Cry protein.  
+
 
 +
<figure><a href="https://static.igem.org/mediawiki/2016/9/97/T--Wageningen_UR--pipeline_NGS_result_inspection.pdf"><img src="https://static.igem.org/mediawiki/2016/3/3a/T--Wageningen_UR--toplogobutton.jpg"></a><figcaption>Click the button to go to the screenshot images! </figcaption></figure>
 +
 
 +
<p>The button links to a pdf which shows 4 screenshots of further examination of "gene_678", which was found to be a potential Cry protein.  
 
<br/>
 
<br/>
<b>The first image</b> shows the blast result from this gene against the non redundant protein database. The conserved domains shown are the OrfB_IS605 superfamily and the Cysteine rich_CPCC domain. Neither of these are known to have a functional characterization. The main hits in this image are of "transposases", which are a class of genes known to move, and bind to, <a class="tooltip">Transposons<span class="tooltiptext" style="width:500px;"> A transposon is a DNA sequence that can change its position within a genome.</span></a> But only the first 77% of the gene hits to anything. So we decided to investigate the remaining 23% of the gene.
+
<b>Image 1)</b> shows the BLAST result from this gene against the the <a class="tooltip"><i>nr</i> database protein<span class="tooltiptext" style="width:500px;">The Non-Redundant (<i>nr</i>) contains non redundant sequences from GenBank translations (i.e. GenPept) together with sequences from other databanks (Refseq, PDB, SwissProt, PIR and PRF).</span></a> The conserved domains shown are the OrfB_IS605 superfamily and the Cysteine rich_CPCC domain. Neither of these are known to have a functional characterization. The main hits belong to a class of genes known as transposases, which are known to move, and bind to, <a class="tooltip">Transposons<span class="tooltiptext" style="width:500px;"> A transposon is a DNA sequence that can change its position within a genome.</span></a> However, this domain corresponds only to the first 77% of the gene sequence. The remaining 23% requires further investigation.
 
<br/>
 
<br/>
<b>The second image</b> shows the blast result from the unknown 23% of the gene against the non redundant protein database. In this image this piece is just named "Protein Sequence (71 letters)" and seems to be quite comparable to known membrane proteins. Cry proteins are known to interact with the membrane to form the pores needed to kill their target.
+
<b>Image 2)</b> shows the BLAST result from the remaining 23% of the gene sequence against the <i>nr</i> protein database (in this image named “Protein Sequence (71 letters)”). The sequence shows similarity to known membrane proteins. Cry proteins are known to interact with the membrane to form the pores needed to kill their target.
 
<br/>
 
<br/>
<b>The third image</b> shows the output of the "Coils" expasy tool <sup><a href="#ts20" id="ref_ts20">4</a></sup>, used to examine the secondary structure of this protein. This image shows that there may be some coils at the start of the protein, and some around the 200 amino acid area.  
+
<b>Image 3) </b> shows the output of the "Coils" expasy tool <sup><a href="#ts20" id="ref_ts20">4</a></sup>, used to examine the secondary structure of the Cry2Ab8 protein. This image shows that there are probably some coils around the 100 amino acid area.
 
</p>
 
</p>
 +
<br/>
 +
<b>Image 4)</b> shows the output of the "Coils" expasy tool <sup><a href="#ts20" id="ref_ts20">4</a></sup>, used to examine the secondary structure of the protein resulting from "gene 678". This image shows that there may be some coils at the start of the protein.
 +
</p>
 +
<p>The other 3 proteins were also examined and showed similar BLAST hits and were highly similar to "gene_678". In order to truly designate these hits from the pipeline as Cry proteins they will need experimental validation.
 
</section>
 
</section>
 
 
<section id="references">
 
<section id="references">
 
<h1>References</h1>
 
<h1>References</h1>

Revision as of 19:49, 19 October 2016

Wageningen UR iGEM 2016

 

Toxin Scanner

BioBrick discovery

For iGEM 2016 we designed a high-throughput pipeline for the identification of novel proteins directly from raw genome sequencing data. Given the specificity of our tool and the importance of biobrick discovery in iGEM, we made it publicly available for everyone to modify and use. The tool can be found in this Gitlab repository .
For the purpose of the BeeT project, we use it as a cry toxin predictor, given genomes of selected bacteria.

Toxin Specificity

For this project we need a toxin that specifically targets Varroa destructor. The most well known miticidal proteins are the crystal (Cry) proteins. These are usually found on megaplasmids from Bacillus Thuringiensis and related species. 14

We know Varroa-specific miticidal activity exists in Bacillus thuringiensis and related species, as shown in: "In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of Bacillus thuringiensis." by Alquisira-Ramírez et al. 2 In this paper, several isolates are described that cause a mite mortality of up to 100%. Importantly, the strains also showed no miticidal activity against bee larvae. Because of this we started several sub-projects in parallel to maximize our chances of finding a viable V. destructor-killer. The specificity part of our project focuses on creating V. destructor-gut binding Cry toxins and finding V. destructor-specific miticidal proteins.

There already exists a publication about a tool called “Bt Toxin Scanner”1. This tool does not fully support local deployment, which is needed for high-throughput analysis. Also, because of the relatively basic analysis done by the tool, we decided to develop our own tool that is fully open-source and improves upon the analysis techniques used in Bt Toxin Scanner. Our goal with this tool is to run raw sequencing files, and deliver potential Cry proteins with just the click of a button.

This tool was made in preparation for results of the latter, finding V. destructor-specific miticidal proteins, which we assume to be of the Cry protein family. These Cry proteins are a diverse group, but are known to be highly specific for individual insects, acari, nematodes and various other eukaryotic taxa. Cry proteins are not necessarily a group of proteins that all perform the same function in the same manner. The distinction between Cry and non-Cry proteins is defined by a committee: Cry Protein website Based on 45% sequence similarity there are over 70 groups. This high amount of diversity makes it hard to predict when something is or isn't a Cry protein. Despite this diversity, many of them have the same three domain structure. The N-terminal domain I is involved in membrane insertion and pore formation, while domains II and III are involved in receptor recognition and binding to them.

Testing the tool

We tested the tool on a genome sample from a study with accession number PRJEB5931. 13 This genome was found after a co-evolution experiment, and a Bacillus thuringiensis with known nematicidal Cry proteins present.
ERX463573 is the accession code of the experiment from which these raw read files came. The experiment this came from was about a Population of Bacillus thuringiensis which were coevolved with Caenorhabditis, which is a kind of nematode, as host. From the study we know to expect at the very least the following two proteins: Cry35Aa4 and Cry21Aa2.
According to the Cry protein toxin list it is known that Cry35Aa4 is a binary toxin with Cry34Aa4 , and as such we may expect to find this protein, or one like it, as well.

Visualizing the result

The easiest way to visualize the results from the three separate components of the tool is to use a Venn diagram.

Figure 1: A Venn diagram showing the overlap of the output of the various methods used to predict whether or not certain genes from the genome are Cry proteins or not.

As Blast had only 5 results it is easier to examine this method in detail. Two of the five were indeed proteins we expected from the paper: Cry35Aa4 and Cry21Aa2. Both of these were also picked up by the Hidden Markov Model method of cry protein detection as shown in figure 1, but not by the RandomForest method.
Next we found a protein that matched very well with the Cry34Aa group, which is a complimentary protein to Cry35Aa4.
The two other proteins which were found: Cry38Aa1, which has no known insecticidal target, but is highly similar to proteins that do: Cry15Aa1, Cry23Aa1, and Cry33Aa1. 17
and "Gene_5518" which was 85.87% identical to: Cry14Ab1. This protein is only mentioned in a patent by Sampson et al. 2012.18 This protein is quite interesting because it might be a completely new protein or a variation of Cry14Ab1 specific to nematodes.

Tool overview

We use a combination of existing tools to come to the prediction of novel cry proteins. The entire pipeline consists of four scripts in total, one of which is entirely dedicated to analysis of the Random Forest model and not further used in the main program. The others are there to group the Machine Learning specific functions, the functions that handle known cry proteins, and the functions that handle raw sequence data. Figure 2 gives a graphical representation of the pipeline, though some modules have been left out for the sake of readability. Here, we go through each part of the process in a step-by-step manner.

idba_ud genemark Cry protein hmmscan blastp RandomForest
Figure 2: A graphical representation of the pipeline showing the various methods and tools used. All the pink diamonds are clickable and will take you to the respective tool's homepage. The known Cry proteins box will take you to the Cry protein database.

Software description

Click here for a highly detailed overview of how the pipeline works.

Click here to open or close the overview.

Varroa Isolate results

The Varroa isolates experiment found an isolate of interest which warrented more analysis. This potential Varroa-killing bacteria was sequenced using Next Generation Sequencing (NGS). According to the 16S analysis results as seen in Lisa's notebook this isolate appeared to be a close match to Lysinibacillus. From over 17.5 million reads, the pipeline assembled the sequencing results into a genome consisting of 822 contigs. From this assembled genome the pipeline was able to find 4,427 genes, of which 4 were identified as potential Cry proteins. These were further investigated. For the results, click on the button below. More statistics about the assembly can be found in Table 1.

Table 1. Genome Assembly statistics from the Varroa isolates experiment
Total reads 17,579,690
Number of aligned reads 17,078,201
Expected genome coverage 4.35088
De Bruijn Graph edges 112
Total contigs 822
n50 value 163,362
Largest contig length 504,583
Mean contig length 4,905
Total genome length 4,032,210
Click the button to go to the screenshot images!

The button links to a pdf which shows 4 screenshots of further examination of "gene_678", which was found to be a potential Cry protein.
Image 1) shows the BLAST result from this gene against the the nr database proteinThe Non-Redundant (nr) contains non redundant sequences from GenBank translations (i.e. GenPept) together with sequences from other databanks (Refseq, PDB, SwissProt, PIR and PRF). The conserved domains shown are the OrfB_IS605 superfamily and the Cysteine rich_CPCC domain. Neither of these are known to have a functional characterization. The main hits belong to a class of genes known as transposases, which are known to move, and bind to, Transposons A transposon is a DNA sequence that can change its position within a genome. However, this domain corresponds only to the first 77% of the gene sequence. The remaining 23% requires further investigation.
Image 2) shows the BLAST result from the remaining 23% of the gene sequence against the nr protein database (in this image named “Protein Sequence (71 letters)”). The sequence shows similarity to known membrane proteins. Cry proteins are known to interact with the membrane to form the pores needed to kill their target.
Image 3) shows the output of the "Coils" expasy tool 4, used to examine the secondary structure of the Cry2Ab8 protein. This image shows that there are probably some coils around the 100 amino acid area.


Image 4) shows the output of the "Coils" expasy tool 4, used to examine the secondary structure of the protein resulting from "gene 678". This image shows that there may be some coils at the start of the protein.

The other 3 proteins were also examined and showed similar BLAST hits and were highly similar to "gene_678". In order to truly designate these hits from the pipeline as Cry proteins they will need experimental validation.

References

    1. Ye, W., Zhu, L., Liu, Y., Crickmore, N., Peng, D., Ruan, L., & Sun, M. (2012). Mining new crystal protein genes from Bacillus thuringiensis on the basis of mixed plasmid-enriched genome sequencing and a computational pipeline. Applied and environmental microbiology, 78(14), 4795-4801.

    2. Alquisira-Ramírez, E. V., Paredes-Gonzalez, J. R., Hernández-Velázquez, V. M., Ramírez-Trujillo, J. A., & Peña-Chora, G. (2014). In vitro susceptibility of Varroa destructor and Apis mellifera to native strains of Bacillus thuringiensis. Apidologie, 45(6), 707-718.

    3. Compeau, P. E., Pevzner, P. A., & Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29(11), 987-991.

    4. Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.

    5. Crickmore, N., Baum, J., Bravo, A., Lereclus, D., Narva, K., Sampson, K., Schnepf, E., Sun, M. and Zeigler, D.R. " Bacillus thuringiensis toxin nomenclature" (2016) http://www.btnomenclature.info/

    6. Cock PA, Antao T, Chang JT, Bradman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423

    7. Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.

    8. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.

    9. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755-763.

    10. Sievers, F., & Higgins, D. G. (2014). Clustal Omega, accurate alignment of very large numbers of sequences. Multiple sequence alignment methods, 105-116.

    11. Guruprasad, K., Reddy, B. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein engineering, 4(2), 155-161.

    12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.

    13. Masri, L., Branca, A., Sheppard, A. E., Papkou, A., Laehnemann, D., Guenther, P. S., ... & Brzuszkiewicz, E. (2015). Host–pathogen coevolution: the selective advantage of Bacillus thuringiensis virulence and its cry toxin genes. PLoS Biol, 13(6), e1002169.

    14.de Maagd, R. A., Bravo, A., & Crickmore, N. (2001). How Bacillus thuringiensis has evolved specific toxins to colonize the insect world. TRENDS in Genetics, 17(4), 193-199.

    15. Peng, Y., Leung, H. C., Yiu, S. M., & Chin, F. Y. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 28(11), 1420-1428.

    16. Besemer, J., Lomsadze, A., & Borodovsky, M. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic acids research, 29(12), 2607-2618.

    17. Baum, J. A., Chu, C. R., Rupar, M., Brown, G. R., Donovan, W. P., Huesing, J. E., ... & Vaughn, T. (2004). Binary toxins from Bacillus thuringiensis active against the western corn rootworm, Diabrotica virgifera virgifera LeConte. Applied and environmental microbiology, 70(8), 4889-4898.

    18. Sampson, K. S., Tomso, D. J., & Dumitru, R. V. (2012). U.S. Patent No. 8,318,900. Washington, DC: U.S. Patent and Trademark Office. 19. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.

    20. Lupas, A., Van Dyke, M., and Stock, J. (1991) Predicting Coiled Coils from Protein Sequences,Science 252:1162-1164.