Team:UC Davis/Discovery

Cyantific: UC Davis iGEM 2016

Protein Discovery

Metagenomic mining was used to discover 13 protein sequences we believed were plausible GAF domain homologs in an effort to expand the known color range of the GAF domain. To assist in future enzymatic discovery, a predictive model was developed that outputs whether an unknown sequence is likely to reflect in the blue, red, or yellow spectrum. Three of the 13 sequences were transformed into E.coli and subsequent analysis were performed to find that our model correctly predicted all three GAF sequences would reflect above 600 nm.


Process and Motivation

Since metagenomes provide diverse and largely undiscovered species, we used the Joint Genome Institute Integrated Microbiome database to mine the metagenome samples for sequences that were similar to known blue reflecting GAF domains. This process begins by BLAST searching a known GAF domain against deposited metagenome samples. This search will return an E-value, or the expected value that this alignment will occur by random. Lower E-values mean the less likely the alignment is a false positive, because of this our team only used alignments with an E-value of 1e-50 of lower. Of the remaining alignments, the surrounding protein families, top genomic isolate hits, and metagenome sample annotation was taken into consideration before choosing which homologs to attempt transformations with. After searching more than 100 metagenomes, our team narrowed it down to 13 plausible GAF domain homolog sequences which came from the following environments:

  • Hot spring microbial communities from Yellowstone National Park, Wyoming, USA
  • Groundwater microbial communities from subsurface biofilms in sulfidic aquifer in Frasassi Gorge, Italy
  • Estuarine microbial communities from Elkhorn Slough, Monterey Bay, California, USA
  • Bacteria ecosystem from Joint Genome Institute, California, USA
  • Hypersaline mat microbial communities from Guerrero Negro, Mexico

Also, two previously untested GAF protein sequences were donated by our collaborators in the Clark Lagarias lab at UC Davis.


Although we chose to only investigate 13 sequences, there were many more we could have chosen. The amount of genomic and metagenomic data deposited in sites like JGI, NCBI and HMMER is massive and continually growing (23). Already there is 17,000 genomes deposited in NCBI(23). It’s relatively easy to find protein homologs, but it’s much harder to choose which genes to investigate. This is complicated even further if you are searching for a subgroup within one protein family, like searching for just blue reflecting GAF domains. For this reason, we developed a model to assist in choosing the GAF protein homologs based on subgroup identity.

Model development

In order to refine the GAF homologs down to the subgroup level, ideally we needed a model that could identity patterns in protein sequences which affect the subgroup identity. In searching for existing programs, we discovered over twenty programs dating back to 1996 that were designed for a similar purpose. However, many were not maintained over the years or open access. Of the 12 programs our data was tested set on, most failed to correctly separate the protein family based on subgroup. Since we were unable to find a program which fit our needs, we developed our own.

Historically, consensus sequences and pro-site patterns were used to identify influential patterns in sequences. While these methods are effective for highly conserved sequences and regions, they require arbitrary data cut-offs which makes their methods a bit coarse. One method which does not require these cut-offs are profile Hidden Markov Models, or profile HMMs. Profile HMM’s are statistical models that give the probability of any amino acid or DNA base at any position in the sequence. For example, if there is a lysine at position 2 in the sequence alignment this informs not only the probability of lysine at position 2, but the probability of other amino acids at that position. Profile HMM’s have become widely used in identifying protein families, even Pfam the protein family database used HMM’s to curate their database (24).



Specifically, the model began with about 100 previously characterized GAF protein sequences provided by Clark Lagarias’s laboratory which we then divided into blue, yellow, and red subgroups. The blue subgroup represents any GAF sequence that absorbs in the red spectra, and thus reflects and appears blue. For the purposes of this model, GAF domains absorbing from 550nm and below are considered in the yellow subgroup, from 550-600nm are considered in the red subgroup, and from from 600nm and above are considered in the blue subgroup. For every subgroup we created multiple sequence alignments that were influenced by the protein structure using Promals3D. Using the protein structure in the multiple alignment allows relatively divergent clusters to be aligned with more accuracy because it takes into account sequence-based constraints derived from the 3D structures (20) . After creating the three multiple sequence alignments, a profile HMM was created using HMMER. The model then aligns the user given sequence and compares it to all three profile HMMs.By calculating the log2 odds ratio of the observed to expected E value, a quantifiable score is produced. This model now has a web interface and is downloadable on github here.

To assess the accuracy of our model a four-fold cross validation was performed. This meaning a random quarter of the GAF sequences were removed and only the remaining 3/4ths were used for the profile HMM creation. This ensures that the model is not populated with the sequences which will be used for assessing accuracy, and thus our accuracy measurement is not biased. The reduced model constructed from just 3/4ths of the original sequences would then be queried by the randomized quarter of previously removed sequences and calculated how accurately the model could blindly predict the sequences. We repeated this three other times until all random and no overlapping quarters were used as the query.Through this validation we calculated our model has a 75% accuracy which is 42% more accurate than would occur by random chance.

Future Goals

In the future, our team would like to refine this model to have more specificity. Instead of our model predicting three subgroups, we would like to alter it to predict the wavelength range the GAF sequence will absorb at. This level of specificity would be beneficial in distinguishing more nuanced differences such as orange from yellow and blue from green.



  1. Trotter, Greg. "Food Companies Are Phasing out Artificial Dyes, but Not Fast Enough for Some." Food Companies Are Phasing out Artificial Dyes, but Not Fast Enough for Some. Chicago Tribune, 24 June 2016. Web. 18 Oct. 2016.
  2. "Konica Minolta - Products." Instrument Systems - Sensing Americans. Konica Minolta Sensing Americans Inc., Oct. 2016. Web. 18 Oct. 2016.
  3. "Our Proposition on Artificial Colors - Colors Policy." MARS Incorporated, Oct. 2016. Web. 18 Oct. 2016.
  4. "Kraft Mac & Cheese Says Goodbye to the Dye." Decision - 2016. NBC News, 20 Apr. 2015. Web. 18 Oct. 2016.
  5. "Taste of General Mills." A Big Commitment for Big G Cereal. General Mills Cereal Incorporated, 22 June 2015. Web. 18 Oct. 2016.
  6. "Nestlé USA Commits to Removing Artificial Flavors and FDA-Certified Colors from All Nestlé Chocolate Candy by the End of 2015." 150 Years of Good Food, Good Life. Nestle Incorporated, 17 Feb. 2015. Web. 18 Oct. 2016.
  7. Alexandratos, Nikos, and Jelle Bruinsma. "World Agriculture Towards 2030/2050." Agricultural Development Economics Division - Food and Agriculture Organization of the United Nations. Global Perspective Studies Team, June 2012. Web. 18 Oct. 2016.
  8. "Compound Summary for CID 19700." Brilliant Blue FCF. Pub Chem - Open Chemistry Database, 15 Oct. 2016. Web. 18 Oct. 2016.
  9. "Summary of Color Additives for Use in the United States in Foods, Drugs, Cosmetics, and Medical Devices." For Industry - Color Additives - Color Additive Inventories. U.S. Food & Drug Administration, May 2016. Web. 18 Oct. 2016.
  10. http://link.springer.com/article/10.1007/s00217-004-1062-7
  11. http://www.instructables.com/id/Blue-Foods-Colorful-cooking-without-artificial-dy/
  12. A Brief History of Phytochrome Nathan C. Rockwell and J. Clark Lagarias
  13. Hirose, Yuu, et al. "Green/red cyanobacteriochromes regulate complementary chromatic acclimation via a protochromic photocycle."Proceedings of the National Academy of Sciences 110.13 (2013): 4974-4979.
  14. Rockwell, Nathan C., and J. Clark Lagarias. "A brief history of phytochromes." ChemPhysChem 11.6 (2010): 1172-1180.
  15. Bussell, Adam N., and David M. Kehoe. “Control of a Four-Color Sensing Photoreceptor by a Two-Color Sensing Photoreceptor Reveals Complex Light Regulation in Cyanobacteria.” Proceedings of the National Academy of Sciences of the United States of America 110.31 (2013): 12834–12839. PMC. Web. 7 Oct. 2016.
  16. http://www.nutritionaloutlook.com/food-beverage/coloring-spirulina-blue
  17. Nature’s Palette: The Search for Natural Blue Colorants
  18. https://www.novozymes.com/en
  19. http://www.dsm.com/corporate/home.html
  20. http://prodata.swmed.edu/promals3d/promals3d.php
  21. https://www.neb.com/products/e6901-impact-kit
  22. Heim R, Cubitt A, Tsien R (1995)."Improved green fluorescence" (PDF). Nature. 373 (6516): 663–4.doi:10.1038/373663b0.PMID7854443.
  23. https://www.ncbi.nlm.nih.gov/genome/browse/
  24. http://pfam.xfam.org/
  25. Red/Green Cyanobacteriochromes: Sensors of Color and Power - lagarias
  26. https://en.wikipedia.org/wiki/Brilliant_Blue_FCF#/media/File:Blue_smarties.JPG
  27. Fischer, A. J. and J. C. Lagarias (2004). "Harnessing phytochrome's glowing potential." Proceedings of the National Academy of Sciences of the United States of America 101(50): 17334-17339.
  28. Stability of phycocyanin extracted from Spirulina sp.: Influence of temperature, pH and preservatives- R Chaiklahan, N Chirasuwan, B Bunnag - Process Biochemistry, 2012 - Elsevier
  29. http://www.lightboxkit.com/Assay_dye.html
  30. "Your Brain - A User's Guide - 100 Things You Never Knew." National Geographic Time INC. Specials 12 Feb. 2016: Print.
  31. Sensing, Konica Minolta. "How Color Affects Your Perception of Food." Konica Minolta Color, Light, and Display Measuring Instruments. Konica Minolta Sensing Americans Inc., 2006. Web. 10 Oct. 2016.
  32. "Center for Disease Control and Prevention - General Information Escherichia Coli." CDC 24/7: Saving Lives, Protecting People. Centers for Disease Control and Prevention, 06 Nov. 2015. Web. 10 Oct. 2016.
  33. Olmos J, Paniagua-Michel J (2014) Bacillus subtilis A Potential Probiotic Bacterium to Formulate Functional Feeds for Aquaculture. J Microb Biochem Technol 6: 361-365. doi:10.4172/1948-5948.1000169
  34. Zeigler, Daniel R. "Bacillus Genetic Stock Center." Bacillus Genetic Stock Center Website. The Bacillus Genetic Stock Center - Biological Sciences, Oct. 2016. Web. 10 Oct. 2016.
  35. Vellanoweth, Robert Luis, and Jesse C. Rabinowitz. "The Influence of Ribosome binding-sites Elements on Translational Efficiency in Bacillus Subtilis and Escherichia Coli in Vivo." Molecular Microbiology 1105-1114, 15 Jan. 1992. Web. 9 Oct. 2016.
  36. "Pveg (Plasmid #55173)." Addgene - The Nonprofit Plasmid Repository. Synthetic Biology; Bacillus BioBrick Box, Aug. 2016. Web. 10 Oct. 2016.
  37. Gambetta, Gregory A., and Clark J. Lagarias. "Genetic Engineering of Phytochrome Biosynthesis in Bacteria." Section: Molecular and Cellular Biology. Proceedings of the National Academy of Sciences of the United States of America, 19 July 2001. Web. 9 Oct. 2016.
  38. "Registry of Standard Biological Parts - B0015." Part: BBa_B0015. International Genetically Engineered Machine Competition (iGEM), 17 July 2003. Web. 9 Oct. 2016.
  39. "Bacillus Subtilis - Indiamart." IndiaMART InterMESH Ltd., n.d. Web. 9 Oct. 2016.