Team:Stanford-Brown/SB16 Software


Stanford-Brown 2016

Protein Optimization & Gibson Assembly Primer Design

Overview

The purpose of this script is to expedite the process of protein DNA sequence optimization and Gibson Primer design for Gibson assembly reactions. The files for the script can be found either here on github or be downloaded individually from the IGEM database below.

Note because IGEM does not support .py file extensions, you will need to remove .txt from the extensions of all of the downloaded files from the links below and append ".py" to the end of (1) seq_analyzer, (2) seq_tools, (3) input_tools, and (4) format_tools (i.e. "seq_tools.txt" becomes "seq_tools.py.") No further formatting is needed for the github link download.

Input File Formatting

In order to run the program, the user has to provide a text file containing the sequences in need of optimization and/or primer design. Each sequence should be included on a separate line; any blank newline entries and spaces will result in a processing error. The sequences should also contain no other characters other than that representing nucleotides (A,T,G,C) or amino acids (G,A,L,M,F,W,K,Q,E,S,P,V,I,C,Y,H,R,N,D,T,*).

For protein optimization, sequences in the input file are not limited to only amino acid or nucleotide sequences--the user can input either and the program will recognize the sequence type and process it accordingly. This is limited however in cases where the only amino acids in the sequence are alanine, threonine, cysteine, and glycine, since the single letter code for each amino acid is also found in the single letter nucleotide representations. For these cases, the program will be unable to distinguish between an amino acid or nucloetide sequence--this however can be corrected by putting a "*" at the end of an amino acid sequence.
  1. First sequence contains at least 50nt of the 3' end of the backbone where the 5' end of the first fragment will join to
  2. N number of sequences to be assembled ...
  3. Last sequence contains the 5' end of the backbone where the the 3' end of the last fragment will join to This is illustrated in the following figure, where the first sequence in the file is "Backbone front", last sequence is "Backbone rear", and the middle sequences in the file would be the N fragments listed in order of assembly.


Sample input and output files are also included in the repository and are listed below:

Input:

Output:

Once the user input files have been formatted properly, the script can be run.

Running the Program

Before the script can be run, make sure you have downloaded the following files to a directory of your choosing:
  • seq_analyzer.py
  • seq_tools.py
  • input_tools.py
  • format_tools.py
  • Codon_Lib
  • NT_Lib
  • restriction_enzymes

These files contain the functions needed by the main script to run the algorithms, and also contain libraries for the program to read from that contain codon and nucleotide pairing maps, and restriction site sequences.

Additionally, the user should download

which are codon frequency use tables for their respective organisms. If a different codon table is desired, the user can create a textfile with each row (codon) arranged as such:

3nt_codon, \t, single_letter_aminoacid_abbreviation, \t, frequency, #/1000


Example:


GCT A 0.16 15.34
GCC A 0.27 25.51
GCA A 0.21 20.28
GCG A 0.36 33.66 …


To run the script, you will need to open terminal or command prompt on your computer. For windows, press Windows+R, and then type “cmd” into the run bar, and hit enter. On mac, hit Command+Space, type “terminal” and then hit enter. This should open a command line interface in which the program can be run.

Commands are input into the program in the following format:

python seq_analyzer.py -if sequence_file -m (g)ibson/(p)rotoptimization -c organism_codon_file -o output_file -t speed
			
in which the arguments are:
  • -if = input filename, note the input file should be in the same directory
  • -m = mode, p for protein optimization, and g for gibson primer design
  • -c = organism codon frequency usage
  • -o = output filename
  • -t = speed at which you want to run the program–a larger number is 0.

The required arguments are -if, -m, and -c. If no output filename is designated, the output file will be named after the input file with “.output” concatenated on the end. if no speed is designated, the default speed is 1 second between each step.

Example:

python seq_analyzer.py -if sequence_file -m g -c ecoli -o output -t 10

python seq_analyzer.py -if sequence_file -m g -c ecoli 10
Both commands above do the same thing, except the first command designates an output file called output and makes the operation run at 10 seconds between each step.

The script will then run and depending on the operation selected ask for appropriate user input and at the end output a text file with the results.

Sequence processing

When running the protein optimization module, the program will process each sequence individually. For each sequence, the program will convert the sequence to its codon optimized DNA form (for amino acid sequences, it will convert it directly to an optimized DNA sequence; for DNA sequences, it will convert it into an amino acid sequence to verify length, and then conver it to an optimized DNA sequence).

Each sequence will then be scanned for restriction sites–if a restriction site is found, the program will isolate that area, and iterate over optimal codons until it finds an iteration that does not contain any restriction sites. This newly optimized area will then replace the original code.

When running the gibson optimization module, the program will ask the user for a melting temperature and a salt concentration. The recommended gibson assembly temperature is 48C from NEB, while the standard salt concentration for a gibson reaction is around 0.05M, or 50mM. The user can tune either parameter to their specifications, however, note a high salt concentration and/or low melting temperature input can result in primers being more than several degrees apart in melting temperature. The melting temperature of the primer is calculated via these formulas.

The script will then generate the junction between the fragments to calculate the optimal homology overlap and primer extension to match the temperature and salt requirement. The process of gibson primer design is indicated in the figure below, adapted from NEB. These will then be output in a text file in a readable format.

alt text

Output Files

Sample output files are included for Protein Optimizationand Gibson processeson github, and linked in iGEM under "Input File Formatting".

If there are any questions and/or comments concerning the program and its usage, please let us know!

Thanks!