To demonstrate the utility of our algorithm for a broad range of applications and across diverse gene types, we wanted to assess how many mutagenic sites our algorithm is able to eliminate. For the most comprehensive assessment possible, we calculated statistics for the number of sites in every valid open reading frame (ORF) on the Standard Registry of Biological Parts.
Using software that we wrote last year to wrap the Registry’s API, we extracted sequences and classifiers for every part on the Registry, then filtered those sequences with a heuristic to define genes containing an ORF. In all, we characterized 2,594 genes. With a set of automated scripts, for all of these genes we calculated the number of mutagenic sites for irradiation, oxidation, alkylation, base-runs, and microhomologies in the original part sequence, then calculate the number of all of these sites that would remain after optimization through our algorithm both for each type of mutation and for overall optimization. We also show summary statistics on the percent of sites remaining after directed-optimization.
With this tool, we have expanded the characterization of nearly every gene on the iGEM registry by adding valuable additional information on each part’s predicted relative risk of mutation. This data set, which have made publicly available, is a useful resource for anyone using the Registry for gene sequences in their projects, especially those concerned about the stability, longevity, and safety of their system.
Statistics on Optimization
With the resource we created for characterizing mutation sites across the Registry, we also compiled aggregate statistics to show how well the optimization performs on real gene sequences. As shown on the series of graphs below, mutation types can greatly differ in how well they can be counteracted. For example, on average 37% of pyrimidine dimer sites (weighted based on their mutagenicity) can be removed through synonymous substitutions, while 58% of oxidation and 50% of alkylation sites can be eliminated. Base-runs, which increase rates of polymerase error, can also be effectively removed at a high rate (80% on average, with a mode of having ~100% of runs over 5bp removed).
In addition, we benchmarked the software’s performance when handling these sequences as a realistic use-case. For an approximately 1000 bp sequence, the optimizations specific for a one mutation type will complete in 0.8 – 1.1 seconds on an Intel i7 processor. An overall optimization that checks every mutation type takes approximately 3 seconds. These speeds easily accommodate most lab workflows, and there is considerable room for further speed optimization (for example, compiling the program into cython) if larger datasets are being used.
Conclusions
Both metrics of our software’s performance scored well in our testing, including how well it can optimized sequences and how computationally effective it implements its optimization. Our software can eliminate high percentages of mutagenic sites with only synonymous changes that retain efficient codon usage, and at speeds fast enough to accommodate bioinformatics analyses such as the comprehensive gene characterization we performed on the Registry of Standard Biological Parts. In the process, we generated a large amount of new data characterizing the mutation risk of thousands of genes encompassing most ORFs on the Registry that is resource for future researchers.