Protein Optimization & Gibson Assembly Primer design
Overview
The purpose of this script is to expedite the process of protein DNA sequence optimization and Gibson Primer design for Gibson assembly reactions. The files for the script can be found either here on github or be downloaded individually from the IGEM database below.
Note because IGEM does not support .py file extensions, you will need to remove .txt from the extensions of all of the downloaded files from the links below and append ".py" to the end of (1) seq_analyzer, (2) seq_tools, (3) input_tools, and (4) format_tools (i.e. "seq_tools.txt" becomes "seq_tools.py.") No further formatting is needed for the github link download.
Sample input and output files are also included in the repository and are listed below:
Input:
Output:
Once the user input files have been formatted properly, the script can be run.
Note because IGEM does not support .py file extensions, you will need to remove .txt from the extensions of all of the downloaded files from the links below and append ".py" to the end of (1) seq_analyzer, (2) seq_tools, (3) input_tools, and (4) format_tools (i.e. "seq_tools.txt" becomes "seq_tools.py.") No further formatting is needed for the github link download.
- seq_analyzer
- seq_tools
- input_tools
- format_tools
- NT_Lib
- Codon_Lib
- restriction_enzymes
- ecoli
- ecoliK12
- yeast
Sample input and output files are also included in the repository and are listed below:
Input:
Output:
Once the user input files have been formatted properly, the script can be run.
Running the script
In order to run the program, the user has to provide a text file containing the sequences in need of optimization and/or primer design. Each sequence should be included on a separate line; any blank newline entries and spaces will result in a processing error. The sequences should also contain no other characters other than that representing nucleotides (A,T,G,C) or amino acids (G,A,L,M,F,W,K,Q,E,S,P,V,I,C,Y,H,R,N,D,T,*).
For protein optimization, sequences in the input file are not limited to only amino acid or nucleotide sequences--the user can input either and the program will recognize the sequence type and process it accordingly. This is limited however in cases where the only amino acids in the sequence are alanine, threonine, cysteine, and glycine, since the single letter code for each amino acid is also found in the single letter nucleotide representations. For these cases, the program will be unable to distinguish between an amino acid or nucloetide sequence--this however can be corrected by putting a "*" at the end of an amino acid sequence.
For protein optimization, sequences in the input file are not limited to only amino acid or nucleotide sequences--the user can input either and the program will recognize the sequence type and process it accordingly. This is limited however in cases where the only amino acids in the sequence are alanine, threonine, cysteine, and glycine, since the single letter code for each amino acid is also found in the single letter nucleotide representations. For these cases, the program will be unable to distinguish between an amino acid or nucloetide sequence--this however can be corrected by putting a "*" at the end of an amino acid sequence.
- First sequence contains at least 50nt of the 3' end of the backbone where the 5' end of the first fragment will join to
- N number of sequences to be assembled ...
- Last sequence contains the 5' end of the backbone where the the 3' end of the last fragment will join to This is illustrated in the following figure, where the first sequence in the file is "Backbone front", last sequence is "Backbone rear", and the middle sequences in the file would be the N fragments listed in order of assembly.