Overview
Quote Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer posuere erat a ante.
Someone famous in Source Title
Background
The increased use of non-conventional expression hosts, as was proposed in this years DTU-Denmark project, increases the need for codon optimization of coding sequences used for heterologous protein production. Codon optimization is typically performed by replacing each codon with the most frequently used synonymous codon observed in the host genome. The assumption that the most frequent synonymous codon is also the most efficiently translated is not necessarily true when we consider that a typical or average transcript is not one that is likely to have a higher than average translation efficiency. Highly translated transcripts often contain “reserved” codons that are not the most common but represent codons that best match the tRNA pools in the cell. These tRNA pools can be estimated by the number of tRNA genes that have an anticodon that can decode a codon. This approach is known as the tRNA Adaptation Index (TAI) (dos Reis et al. 2004) when used to assess the translation efficiency of a coding sequence.
From Blackboard Calculations to Software
TaiCO (TAI Codon Optimaztion tool) constitutes a unique computational tool for answering specific biological questions, a statement easily justified by the fact that it is the only stand-alone application in the world (to the best of the team's knowledge) with an implemented Graphic User Interface (GUI) based completely on species specific TAI calculation. The DTU-DENMARK team hopes that this software will contribute with its simplicity to faster and easy to produce Biotechnological results and become a high-end optimizing method with its unique theory implementation.
Theory implemented algorithmically
As you may already know, the central issue in codon optimization is to determine which codons are most efficiently translated for each amino acid. The quantity needed for this task is called 'translatability' and is denoted \(W_i\) for the \(i\)'th codon.
To accomplish this, we have chosen to use a tRNA Adaptation Index-based method (tAI) (dosReis et. al. 2004) REFERENCE. The fundamental assumption behind this method is that highly expressed proteins have their genes encoded with a set of codons that is overall more susceptible to tRNA-binding and translation compared to less expressed proteins. Hence, this optimization estimates the codon preferences such that the correlation between protein level and tAI is maximized.
The formulas for calculating this are stated in Table 1 in dosReis 2004 (SHOULD WE STATE THEM HERE?). Using this, all 64 \(W_i\)'s can be calculated in one matrix multiplication, by letting \(G\) be the 4\(\times\)16 matrix consisting of the tGCN's (in TaiCO referred to as 'gcn') and letting \(S\) be the 4\( \times\)4 matrix containing the (1 \(-s_{ij}\)) values. Hence,
$$W = SG$$
The computed \(W_i\)'s are the normalized by setting \(w_i = \frac{W_i}{W_{\text{max}}}\), and those normalized translatabilities, \(w_i\) do then form the basis for codon selection. Higher \(w_i\)-values are simply selected over lower values. This concludes the method for codon selection.
The \(G\) matrix
\(G\) consists of 64 tGCN values, which are the gene copy number of tRNA's recognizing specific codons. Normally, available gcn-files lists the tGCN's in terms of the reversed anticodon corresponding to the recognized codon, hence, the tricodons in the raw gcn-files are reversed and have their bases replaced by the complemetary ones. For instance, in S. cerevisiae the gcn of tRNA's recognizing TTC (encoding glutamic acid) is 10, so in the raw file, this information is presented as the reversed anticodon, GAA, being equal to 10 instead. When converted into their encoding form, the tGCN's are put into the \(G\) matrix such that each column has the first two position fixed and each row has a fixed third position:
AAA | ACA | AGA | ATA | CAA | CCA | CGA | CTA | GAA | GCA | GGA | GTA | TAA | TCA | TGA | TTA |
AAC | ACC | AGC | ATC | CAC | CCC | CGC | CTC | GAC | GCC | GGC | GTC | TAC | TCC | TGC | TTC |
AAG | ACG | AGG | ATG | CAG | CCG | CGG | CTG | GAG | GCG | GGG | GTG | TAG | TCG | TGG | TTG |
AAT | ACT | AGT | ATT | CAT | CCT | CGT | CTT | GAT | GCT | GGT | GTT | TAT | TCT | TGT | TTT |
The \(S\) matrix
While \(G\) is precisely known, \(S\) needs to be optimized. In dosReis 2004, the optimized \(s_{ij}\)-values for S. cerevisiae is published, yielding the \(S\)-matrix, $$ S = \begin{pmatrix} 1 & 0 & 0 & 0.0001 \\ 0 & 1 & 0 & 0.72 \\ 0.32 & 0 & 1 & 0 \\ 0 & 0.59 & 0 & 1 \end{pmatrix} $$ Where both rows and columns are ordered as A,C,G,T. Thus, the \(W_i\)'s computed from the \(SG\) multiplication are each influenced by two tGCN's. As an example, calculating the translatability of CCG will be equal to the dot product of the third row of \(S\) (because third position is a G), and the sixth row of \(G\) (because first two positions are CC): $$ W_{CCG} = 0.32 \cdot \text{tGCN}_{CCA} + 1 \cdot \text{tGCN}_{CCG} $$ clearly taking the wobbling potential of G to A in third position into account.
TaiCO features
The DTU-DENMARK team's proposal for reliable and fast computational production of optimized DNA sequences comes under the name TaiCO. The need for a specialized software tool for optimization of Y. lipolytica DNA sequences became evident when the "product subgroup" of the team started to design constructs for protein expression. TaiCO "allowed" the team's Biotechnologists to perform extended analysis/results of the coding sequences of interest due to its simplistic architecture and resources demands.
Software overview
TaiCO is implemented in Python3. By inspecting the source code it becomes evident that the algorithm was implemented in an easily modifiable layout, due to its static philosophy with the exclusive usage of only built-in libraries and modules in addition to the already known and commonly used "Pythonic" data structures. This software comes with the Open Software license: GPL v3. For a more descriptive view on how the algorithm was implemented, it is heavily encouraged to inspect the source code along with the README.txt file deposited in IGEM SOFTWARE GitHub repo.
Input files and result
The first input file requested from TaiCO is a GCN table in simple text format. Although the software comes bundled with 7 GCN files from model organisms, and thus the user is given the opportunity to choose the target organism that will be used in the wet lab for the actual sequence optimization, he/she can even upload his/her own GCN table from an organism not included in the bundle. The second input file that the user has to provide, is a list with a single or multiple protein sequences that are going to be optimized and parsed through the script in FASTA format. The final (third) input file that the user can provide,although it is considered optional but a very powerful capability, is a simple text file including the sequences of the restriction sites that have to be absent from the optimized DNA resulting sequences. The output of the analysis is a file saved in a FASTA format containing all the optimized DNA sequences.
Compatibility,runtime,distribution
The full script was “converted” into an executable file along with all included modules using the PyInstaller (2) software. This allowed us to make TaiCO available for all the “mainstream” platforms (Unix based systems,Windows,MAC OS). Due to the nature of the supporting PyInstaller software the user has only one mandatory computational task in order to be able to run the software which is to download the preferred zipped version which is stored to the IGEM’s software repository on GitHub, and contains all the essential files for the proper use of the tool. The relevant operating system version of TaiCO can be downloaded by clicking one of the following links: Windows: Unix: MacOS: For further information regarding terms of use and how to use it properly you are strongly advised to inspect the README.txt file or contact the author by email: vrantos@hotmail.gr