Team:Wageningen UR/Notebook/ToxinScanner

Wageningen UR iGEM 2016

 

These experiments were performed by Ronald de Jongh.

March

Week 1 (14th)

Started literature research towards cry proteins, Varroa destructor, and Machine Learning.
Compared acari species to find bt toxins that might serve as a backup for killing the varroa, Found a mite which is related to V destructor and has a bt toxin targeting it.

Week 2 (21st)

WEKA tutorial viewed, checked out bt_toxin scanner ,checked out rapidminer and other tools which might be of use.
Found varroa genome http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-11-602
Genbank of varroa genome was completely unannotated. Will now look at it with Genemark -> faa file and maybe blast all of it, get best match?
Also found the cytostudio tool, which might be interesting.

April

Week 3 (4th)

Continued literature research towards cry proteins, Varroa destructor, the cytostudio tool and Machine Learning. Found cello, a DNA programming language. And received some information about RDF / blazegraph

Week 4 (11th)

Trying out cello, but it keeps crashing, 2 attempts at putting in marijn’s killswitch system made me give up. Continued learning how RDF works.
Downloaded weka and bioweka.
Currently running Augustus on varroa genome (using a mosquito “aedes” model) to predict genes, looking for something we could target.

Week 5 (18th)

Varroa genome has 21190 gene id’s, created a fasta file of the predicted protein sequences. Examined RDF, (Bio)Weka, the Varroa genome and Bt Toxin Scanner some more.
Went to BioSB!
Found a nice cry toxin list which has over 800 sequences. For Bt Toxin Scanner they only seemed to use 606.
When searching for “ endotoxin OR crystal AND Bacillus thuringiensis[Organism] “ in NCBI protein, you get 2242 hits, so I can't really get my data from there, its too much to manually curate. Insstead used getCry.sh to get as many proteins as possible.

Tools in Research Proposal:

Other Tools:

Week 6 (25th)

Checked out galaxy, from SSB which contains: Prodigal, Assembly tools of all kinds, and gene annotation tools, now to check out the data!
Worked on genemark, produced files with two '>' fasta markers in the header. Regular expressions to the rescue!
Figured out the SAPP pipeline pretty much, got genemark to work, making a clustalo of 6 cry proteins as a test for the pipeline.

May

Week 7 (2nd)

Started to build the actual pipeline script in python! Got all cry files in the data directory using my script, though some didn’t get downloaded or weren’t available.
Implementing IDBA_UD, which does not work on reads that are too long.
Building ARFF files is a thing I should be able to do -> https://weka.wikispaces.com/Primer
So it turns out Cry1Ib5 and Cry1Fa3 are duplicates, there may be more in there! I had aligned all with group_cry.sh and put one of those allignments in a webbrowser based hmmsearch, which complained of a duplicate entry.

Week 8 (9th)

Renamed prot.py to known_protein_pipeline_v1.py going to translate all cry[0-9]+\_group.fasta files to .faa files and move them. Renamed thesis_pipeline_v1.py to sequence_data_pipeline_v1.py on the server. Working with Pycharm is quite an improvement. Also made a flowchart of data-structure,
And experimented with Machine Learning in R, after finishing up the pipelines up to HMMs
Found the peptides package for R which can calculate all kinds of statistics about protein sequences, also made a table of features that is used by FFpred.
Could use http://medbac.dk/slst/pacnes in order to make a new BT phylogeny. -> from research proposal.
Talked with people about project integration, tested known protein stuff, something seems to have broken. -> Bugfixing -> issues between unix and dos newlines. dos2unix to the rescue.

Week 9 (16th)

Got a bunch of errors out of the program, it should now be a lot stricter about letting sequences into the HMM’s. The known_protein_pipeline seems to function now.
Thoughts on Pipeline structure:

  • ¿¿Should bash calls be their own function -> “bash_” + thing_you’re_making() ?? (align_files does not do this)
  • Anything I might reuse should be its own function -> sensible_name()
  • Anything both pipelines use should probably be in a separate (super?) file -> eg make_dir, global dir vars, ¿all dir vars?
  • ¿An interface for global vars and things such as extensions?
  • All functions/scripts should have a docstring -> also usable as usage thing
  • Should all userinput be caught with inflags? (run_get_cry does not do this right now)
  • Having user input set to true should always be the default, but I usually call my functions without it in the main pipeline.

Known_pipeline_v1.py is now complete and PEP acceptable, Going to start working on the FFpred tools wrapper,
Going to try to do the Machine Learning/Pattern Recognition wrapper in R using these programs and R.peptides commands:

  • R (peptides.lengthpep)
  • R (peptides.mw)
  • R (peptides.kidara)
  • R (peptides.charge)
  • R (peptides.pI)
  • R (peptides.aacomp)
  • R (peptides.boman)
  • R (peptides.aindex)
  •  Psipred
  • ncoils
  • disopred
  • EMBOSS.epestfind
  • pfilt
  • netphos
  • net-o-glyc
  • net-n-glyc
  • SignalP

FFpred links:

Downloaded DISOPRED, memsat-svm, pfilt and psipred, none of which worked as easily as I had hoped.

Week 10 (23d)

After experimenting with Blastpgp I found it is not worth the computational time to add. And since it is needed for disopred, I looked for a replacement and found IUPRED. It has 3 options, long short and glob, long and short give a number for each amino acid, between 0 and 1, with 1 being high disorder. In my test run these lists are correlated at 0.82. So I’ll probably only use long, and glob.

I found the Ecoli C321 delta chassis: http://www.ncbi.nlm.nih.gov/bioproject/215084with Accession code: PRJNA215084

Playing around with svms and data, its fun! Protein statistics per positional amino acid is hard, maybe do sliding windows of a certain size, get average values?
Compared secondary structure from psipred with psiblasting my own cry prot database to an actual PDB structure, and they fit quite well!

In the FFpred_wrapper script I've been working on I had worked with pieces of 20 amino acids each, instead of being inspired by the real FFpred and creating a certain amount of equal sized chunks per protein.

 

Week 11 (30th)

Bugfixing FFpred, dealing with the format of the comma separated or csv file I'm making. Cross validated the SVMs

<code> grid_svm = GridSearchCV(svm.SVC(), param_grid, n_jobs=9, cv=6) # uses stratified fold? param_grid = [
{'C': [0.1,1, 10, 100], 'kernel': ['linear']},
{'C': [0.1, 1, 10, 100], 'gamma': [‘auto’, 0.001, 0.0001], 'kernel': ['rbf']},
{'C': [0.1,1, 10, 100], 'gamma': [‘auto’, 0.001, 0.0001], 'kernel': ['sigmoid']}, ]</code>

Random Forest gives a lot of good predictive value-scores, which may indicate overfitting. After examining the SVMs and the RandomForest, I will use RandomForest from now on. 85% accuracy from the SVM vs 99% accuracy from the RandomForest.
The way I worked I have some cry proteins left that I havent touched with any machine learning approach, this will be my true validation set to see if the highest scoring methods indeed do as I want.

June

Week 12 (6th)

Investigated cry protein features some more, haven't found much useful information. Worked a lot of cleaning code and adding documentation. Also tested AdaBoost vs RandomForest, but there was no clear advantage.

Week 13 (13th)

 Experimented with clustering before doing AdaBoost, but this time it did not help anything

Total: 1639

Blast-truth

Not found by blast

Predicted hit

10

109

Predicted not-hit

25

1495

TRUTH>

35

1604

The measure I like most is postive predictive value, as we need to be high on that, to get high accurracy, as we want to get as few False Positives in order to reduce futile testing time on the toxicity assays.
PPV here was: 10/119 = 0.084 Without clustering it was a little better. Something like 20/95=0.211

Week 14 (20th)

Switched projects from here on out, working on the Metabolic Modelling project. (mostly)

Week 15 (27th)

Still did some domain analysis with interproscan.

  1. Of the domains now in the 500protein set almost 14 are directly related to endotoxins
  2. 3 are related to lectins: Lectins are a type of protein that can bind to cell membranes. They are sugar-binding and become the “glyco” portion of glycoconjugates on the membranes.
  3. 2 are proteinase inhibitors
  4. Some are associated with pseudo-something synthase <- the only odd ones out.

July

Week 16 (4th)

Metabolic Modelling project

Week 17 (11th)

Metabolic Modelling project

Week 18 (18th)

Picked up the Machine Learning again, after experimenting a lot the main idea now is to compare proteins that are defined as cry and non-cry which do have a certain amount of sequence similarity. This was done by comparing all cry proteins in my database against the current Uniprot database, and then removing all cry proteins and proteins that could be cry proteins. (Hypotheticals/badly annotated ones)

Week 19 (25th)

Made a plot of the ROC curves of 10-fold cross validated RandomForests, it looks amazing.

August

Week 20 (1st)

Worked on getting the PRJEB5931 working with IDBA_UD and the rest of the sequence_data pipeline. Also worked on the pipeline somewhat, tried to find an alternative for roc curves in adaboost, but it's generally not done. Just showing the decision boundary.

Week 21 (8th)

 Metabolic Modelling project.

Week 22 (15th)

Metabolic Modelling project

Week 23 (22nd)

Added a picture of why the RandomForest/ADAboost feature prediction is so good. The data is nearly linearly separable. (Trembl not cry VS actual cry) But I picked ~200 at random for the actual cry, I feel like I should check all of them. I'd like to have a list as output of potentials
Improving pipeline and making sure that the blast works with a nice cutoff, gonna have to make sure that all the cutoffs are exposed to the initial tool run.
The end result is gonna be 3 files with lots of info per tool and 3 files of just genemark codes, which I can put into an online Venn-Diagram tool from the university of Gent.
The testing worked out quite well, putting the results on the main Software page.

Week 24 (29th)

Metabolic Modelling project / Lab experiments

September

Week 25 (5th)

Code Organization:

  • Known_protein: puts information of the known cry proteins into order, creating things like csv files for machine learning, blast databases and hidden markov models
  • FFpred_wrapper: reads in fasta files and creating models out of them through random forest (A big chunk is called by known_protein and the other bits by sequence data)
  • Sequence_data: Predicts cry genes from raw genome sequence using idba_ud, genemark, and then hmmscan, blast and running the ML model. Also writes the result to a final directory, and three less detailed pure lists for easy comparison.
  • Utility: Anything that might be shared by other scripts that isn't related to Machine Learning.
  • Metabolic: (formerly: "testcode") All of the code needed to run various tests on the metabolic model I was fiddling with. (and a call to the R script that produces the output?)

Week 26 (12th)

Blast and the hidden markov models were able to retrieve 2 proteins in my test set that my RF model was not. gene_5934|GeneMark.hmm|310_aa|+|3109|4041
gene_5931|GeneMark.hmm|119_aa|+|362|721

The purpose of my RF model however was to differentiate between proteins similar to cry proteins and those that are not. I think I'll make a consensus sequence of all the non-cry proteins and then see what that blasts against to know which supergroups to exclude. Consensus Sequence was built using hmmemit after hmmbuilding with a clustalo stockholm file of the sequence. In the end I blasted all sequences against all cry proteins, took the best hit and I can see that there is a limited number of super groups which it hits against
Also got pandas and tree interpreter so I can have a closer look at the randomforest and maybe make some kind of summary.

The new randomforest called Unbalanced_RF was made using all cry proteins, the trembl hits and a number of proteins that interpro scan said definitely said have non-cry products (only like 20ish or so)

This new RandomForest model however only hits 2, so now it's too specific. Trying to see how I can open it up using treeinterpreter.

Week 27 (19th)

I have decided to let go of the false positives thing and focus on analysing the modelling. The binsizes don't seem to matter and can probably be removed entirely, I compared the 200 random cry vs Trembl not cry with 0 bins, and that already had great scores.
More work on improving the entire pipeline, but also incorporating a scoring function for the Hidden Markov Models.
Plotting decision boundaries has worked with a script I found and modified.

Week 28 (26th)

Work slowed down and writing started up.

October

Week 30

On the 18th, hours before the wiki freeze the genome data from Varroa isolates came in, this was rapidly annotated and examined, but detailed analysis was impossible in this time scheme.