In the beginning, we searched on UniProtKB/Swiss-Prot. It is a freely accessible database of protein sequence and functional information that is the manually annotated and reviewed section. (http://www.uniprot.org/) By searching the keyword “insecticidal NOT crystal” we wanted to find all the proteins that have insecticidal activity excluding those crystal proteins of Bacillus thuringiensis, and we got 216 proteins as results.
Using the result, we established our Pantide database by crawling 11 entries of the protein information from UniProt. The entries are as follows.
- The name of the protein
- The description of protein function
- The organisms/source of the protein sequence
- The length of amino acids
- The number of disulfides bonds
- Propeptide & signal peptide—If the proteins have an N-terminal signal peptide and propeptide, a part of protein will be cleaved during maturation or activation.
- Uniprot entry & Arachnoserver id—the accession number of protein in UniProtKB and ArachnoServer*.
*ArachnoServer is a manually curated database for protein toxins derived from spider venom.(http://www.arachnoserver.org/).
We also crawled other seven entries of protein toxicity recorded by Arachnoserver—molecular target, taxon, ED50, LD50, PD50, qualitative information, protein sequence from Arachnoserver. The term, Molecular target, is the effect site of toxin peptides, such as voltage-gated ion channels, GABA receptors and so on. Taxon, ED50, LD50, PD50, and the qualitative information are the toxicity against taxon that had been tested by experiments. The protein sequence from two databases is entirely the same.
We utilized BeautifulSoup 4.4.0, sqlite3 and gevent modules in Python 3.5 to develop our crawler. Moreover, we have submitted the code to GitHub.
(Link:https://github.com/chengchingwen/iGEM/blob/master/crawler.py)