Investigate Mutation Rates by High-throughput Sequencing

Determine Mutation by High-throughput Sequencing

After showing mutation by reversion, we wanted to analyze our mutation parts by next-generation sequencing (NGS) in more detail. NGS presents a method to measure mutation rates and the mutation spectrum with extreme precision. Examples for this application are the determination of E. coli (Lee et al. 2012), Yeast (Zhu et al. 2014) and Human (Xue et al. 2009) mutational rates and spectrum.
We used Illumina sequencing (MiSeq) to correctly asses the mutation frequency and spectrum of the error prone polymerase I and our genome wide mutator. Furthermore, we used the obtained data to quantify plasmid copy numbers and ratios between plasmids with the same origin of replication inside one cell. Different plasmids were analyzed by re-sequencing and de novo assembly, respectively.
Each sample was prepared as a separate library using specific tags in the adapters for multiplexing. At first, mutations within the initial plasmid were identified. Moreover, we distinguished between sequencing errors and true mutations by including a control.

Library generation, sequencing and initial computation

We used plasmids from one of our reversion experiments as template for this sequencing experiment. Out of each sample a sequencing library was constructed prepared using the Illumina Nextera DNA Library Preparation Kit. DNA fragments with a length between 500-1,000 bp were isolated. Subtracting the length of both terminal adapter, the DNA fragments contain 360 to 860 bp fragments derived from our plasmids.
Figure 1: Preperated MiSeq library analyzed with the Agilent Bioanalyzer 2100. The results show a fragment length between 500 and 1,000 bp.
The sequencing revealed 2x300 nt (paired-end) reads. Reads were analyzed with FastQC showing ~10,000 reads per library with an average read length of 300 nt (Figure). The average quality distribution of the reads (Figure) was continuously very high until base 250.
Figure 2: Figure: Distribution of read length of the library and quality distribution of the reads pre trimming analyzed with FastQC.
To obtain only high quality reads we trimmed the reads with Trimmomatic (Bolger et al. 2014) by cutting off after four consecutive bases with a quality score lower than 25. The trimmed library was analyzed as well, consisting of X reads with an average length of 220 bp (Figure) and end-to-end quality higher than 25 (Figure). A quality score over 25 means an error probability of 0.0032 (Illumina 2014). The quality score of the whole trimmed library is around 38, resulting in an error probability of 0.00016. Assuming a library size of 2.5×106 bp this leads to about 400 sequencing errors.
Figure 3: Distribution of read length of the library and quality distribution of the reads after trimming analyzed with FastQC.
The reads were mapped with CLC genomics workbench 9.5 to our plasmid references. The coverage and base distribution per position was calculated. However, it was not possible to assign the reads originating from the origin of replication initiation from the two different plasmids in some samples correctly, because both plasmid contain a very similar sequence. This region was excluded for further analysis involving the read coverage depth. As foundation for further analysis TSV files containing the information about the mapping at each position of the reference were exported from CLC Genomics Workbench. TSV files are TAB-delimited text files containing the number of supporting reads for each base at each position (Figure 4).
Figure 4: Structure of generated TSV files.

Determination of mutation rate through NGS sequencing results

We used plasmid combinations according to our reversion experiments as reference for the mapping. These experiments consist of one plasmid carrying the mutagenesis genes (either EP-PolI one of the genome wide mutator systems) and one reporter plasmid with a stop beta-lactamase. Therefore, both plasmids were present in our sequencing library. As a control for measuring sequencing errors our reporter plasmid was sequenced in parallel in a separate library.

Error prone polymerase I

The plasmids for the sequencing libraries were made by transforming 10 ng of the stop beta-lactamase plasmid pLA230 in JS200 cells, carrying the error prone polymerase I. Afterwards, the cells were grown for 24 h at 37 °C in prewarmed LB-broth media supplemented with chloramphenicol and kanamycin. Finally, the cells were plated for the reversion assays or the plasmids were isolated for sequencing. Alongside the experiment was performed using wild type polymerase I instead of the error-prone polymerase I.
Figure 5: Coverage Plots for pLA230 and pHSG-EPPolI. Coverage per base for pLA230 and pHSG-EPPolI were depicted. Important plasmid features are shown as arrows
Table 1: Average coverage of the single plasmids in the plasmid mixes.
pLA230 pHSG-PolI ratio (pLA230/pHSG-PolI)
WT 575.26 98.98 5.8
EP 675.33 141.13 4.8
The average coverage-ratio between pLA230 and pHSG-PolI was 4.8 for the error-prone polymerase and 5.8 for the wild-type polymerase. This confirms values from previously published works that used the error-prone polymerase inside the polymerase I temperature-sensitive E. coli JS200 and observed a decrease in copy number of ColE1 plasmids (Camps et al. 2003). However, the ratio in our results is lower than the published value (Camps, 2003). Furthermore, it is interesting that the published copy number of ColE1 plasmids ranging from 100 to 300 and copy numbers of pSC101-based plasmids, like pHSG, are only between 6 and 8 (Kornberg und Baker 1992; Cabello et al. 1976; Hasunuma und Sekiguchi 1977). Based on this data, we expected ratios between 12.5 and 50. The observed difference between our sequencing based copy number estimation and the previously published values, could be explained by the application of different techniques. Assuming the copy number of the controlled regulation of pSC101 plasmid is correct (6-8) this results in copy numbers for the ColE1 plasmid of 34-46 for the wild type and 29-38 for the error prone polymerase I. In our opinion the determination of plasmid ratios via NGS is more reliable than older methods. Therefore, we will use our determined value of 40 copies with a normal polymerase and 33 copies with the error prone polymerase I as values for our calculations in the reversion assay article.
To assess the mutations incorporated by the error prone polymerase I we wrote a custom python script which
  • Imports a TSV file
  • Counts coverage for all non-reference bases at every positions as well as the total coverage
  • Normalizes the counted mutations by the total coverage of the analyzed plasmid
  • Constructs mutation and coverage plots for all sequences of interest
This script calculates the mutation frequency as mutation frequency per sequenced base. The calculation was done for all mappings to our reporter plasmid. The results provide an overview about the basal mutation rate in combination with the sequencing error and the effect of the error prone polymerase I.
Figure 6: Mutation frequency determined by high-throughput sequencing. The mutation frequency was determined as observed mutation per sequenced bases. The control is our reporter plasmid grown without any kind of mutator. Therefore, this sample was used to control for sequencing errors and other background effects. The higher numbers of differences to the reference in other samples is supposed to represent true mutations. Analyzed samples are our reporter plasmid grown in E. coli JS200 with plasmid expressed wild type polymerase I or error prone polymerase I, respectively.
The calculated mutation rate for the control is 7.55×10-4 mutations per base pair, which is higher than the expected sequencing error of 1.6×10-4 (Illumina 2014). The mutation rate of the wild type polymerase I is 7.3×10-4 and therefore very similar to the control. Use of EP Pol I increases the mutation rate to 10.3×10-4 mutations per base pair. Under the assumption that the difference between the wild type polymerase I sample and the control (0.25×10-4) is the measurement inaccuracy, the error prone sample lies over 10 times this inaccuracy value above the control.
However, it is also possible, that the overexpression of proteins participating in DNA repair like wild type polymerase I leads to a decreased mutation rate.
Furthermore, we used another script than looks for mutations by scanning the generated TSV files for bases, which deviate from the consensus base. All this mutations are summed up, deviating which kind of mutation (reference base &arr; mutated base) is present. This results in an extensive overview of the mutations introduced by the error-prone polymerase but also the sequencing errors.
Figure 7: Observed mutation distribution of the different sequencing libraries. All mutations (n) inside the sequencing data were classified as the appropriate mutation type. The squares show the percentage of mutations from the reference base to a mutated base.
We observed a significant change in mutational spectrum between the error prone polymerase I and the control (p = 1.43×10-6, χ2 = 44.89) but no difference between the wild type polymerase I and the control (χ2 = 3.86). Especially the strong increase of the transitions G→A and C→T is notable. But also the general portion of transition of the complete mutations increases from 28.9 % and 32.7 % in the control respectively in the wild type polymerase I to 43.4 % in the error prone polymerase I strain.
Figure 8: Proportion of transitions and transversions observed from all mutations on our reporter plasmid causedby the error prone, wild type polymerase in E. coli JS200 or our reporter plasmid in E. coli Top10, respectively.
The mutational spectrum of the error prone polymerase I was investigated by Troll and coworkers (Troll et al. 2014) using a total of 123 sanger sequenced clones.
Figure 9: Mutagenic spectrum of the error prone polymerase I as determined by Troll and coworkers (Troll et al. 2014).
In comparison to the published results our experiments revealed a different picture of the mutagenic capabilities of the error prone polymerase I. We found a prevalence for transitions, while Troll found 83 % transversions.
The difference between both results can probably be explained by the different methods applied. This assumption is supported by vastly different mutation spectra previously observed by comparing different methods (Badran and Liu, 2015). When investigation the rpoB gene of forward mutants in a rifampicin assay with sanger sequencing and a small sample size (n < 215) the mutation spectrum differs very strongly from the one determined with NGS and a much larger sample size (n < 112580). Generally, NGS is a more reliable method due to the high amount of observed (mutation)events. Furthermore, NGS is independent from different selection strategy like selection of the occurrence of an antibiotic resistance. This preselection before sequencing is needed to filter out the very large amount (>99 %) of unmutated sequences. However, this selection introduces a mutational bias. Therefore, NGS with a high sample size and the possibility to sequence a library, which was created by mutation without any kind of selection pressure, is the optimal way to determine mutation spectra. Nevertheless, there are some biases in NGS data generation as well. The GC content of the fragment was shown to effect the mapability of a ultra short NGS read (Dohm et al., 2008). However, most of these effects do not impact the results significantly anymore due to an increased read length and improved sequencing chemistry.

Genome wide mutator

Our genome wide mutator system was analyzed via NGS as well. We used the M6 mutator and the dnaQ926 mutator gene alone. The plasmids were used from a corresponding reversion experiment.
The experimental setup was as following:
The M6 mutator BBa_K2082117 or the dnaQ926 device BBa_K2082116 were cotransformed with the reporter plasmid pLA230 into E. coli Top10 and plated on LB agar plates with chloramphenicol, kanamycin and glucose.
From a single colony LB media with chloramphenicol and kanamycin was inoculated, grown to mid-log phase, induced or repressed with arabinose or glucose and plated after 24 h. Plasmids for sequencing were isolated 24 h after induction/repression, yielding four libraries for NGS.
The mutation rate and the mutation spectrum was determined as describe above.
Figure 10: Mutation rate after induction/repression of BBa_K2082117.
The mutation rate increases from the control to the repressed M6 plasmid sample and further to the induced M6 plasmid sample. Using the difference between control and the wild type polymerase sample to measure the inaccuracy of the method. The repressed plasmid sample is 1.8 times and the induced plasmid sample is 7.1 times the measurement inaccuracy above the control. This means that BBa_K2082117 mutates more strongly than the background, thus is a mutator. Furthermore, the induction of BBa_K2082117 works as seen in an increase in mutation rate in comparison to the uninduced sample. Finally, the used promoter is active even in the not-induced state as seen by the increase in mutation rate between the control and the repressed sample. Based on these results we modified our reversion assays in that we kept the repressed control under glucose repression at all times. In contrast: in this experiment we cultivated at first in LB media and added glucose in the repressed sample simultaneous to the addition of arabinose in the induced sample.
Additionally, we wanted to determine the mutagenic spectra of our mutators. This was done as described above.
Figure 11: Mutational spectrum of the M6 and dnaQ926 mutators.
Based on the χ2 test no significant difference between the spectrum of the control and the induced mutators could be observed. (χ2dnaQ926 = 0.85, χ2M6 = 3.78)
Nevertheless, the mutation spectrum is comparable to the published mutation spectrum of the M6 plasmid used by Badran and coworkers (Badran and Liu, 2015).
Figure 12: Mutational spectrum of our M6 system BBa_K2082117 and the one used by Badran and coworkers (Badran and Liu, 2015).

de novo assembly of pLA230

pLA230 was used as control. To make sure our used sequence was correct, we assembled pLA230 de novo from the control dataset and compared it with the published plasmid sequence. We found one point mutation at position and detected an insertion behind the beta lactamase coding sequence.
Figure 13: Alignment of the de novo assembled pLA230 sequence and the pLA230 sequence from Addgene. The alignment shows a point mutation and the deletion inside the pLA230 sequence.


We analyzed the error prone polymerase I and our genome wide mutators with next generation sequencing.
We showed that the error prone polymerase I has a visible increased mutation rate (10x over background). Furthermore we determined the mutation spectrum of the error prone polymerase I and found a prevalence of transitions.
For our genome wide mutator system we found an increased mutation rate when using the M6 BioBrick BBa_K2082117 up to 7x over background. The uninduced sample also showed visible increased mutation (~2x over background), confirming our assumption of considerable leakiness of our promoter, when not consequently repressed with glucose.
We also determined the mutation spectrum of the genome wide mutator and it shows the same mutageic spectra as the MP6 plasmid used by Badran and coworkers(Badran and Liu 2015).


  • Badran, Ahmed H.; Liu, David R. (2015): Development of potent in vivo mutagenesis plasmids with broad mutational spectra. In: Nature communications 6, S. 8425. DOI: 10.1038/ncomms9425.
  • Bolger, Anthony M.; Lohse, Marc; Usadel, Bjoern (2014): Trimmomatic: a flexible trimmer for Illumina sequence data. In: Bioinformatics (Oxford, England) 30 (15), S. 2114–2120. DOI: 10.1093/bioinformatics/btu170.
  • Cabello, F.; Timmis, K.; Cohen, S. N. (1976): Replication control in a composite plasmid constructed by in vitro linkage of two distinct replicons. In: Nature 259 (5541), S. 285–290.
  • Camps, Manel; Naukkarinen, Jussi; Johnson, Ben P.; Loeb, Lawrence A. (2003): Targeted gene evolution in Escherichia coli using a highly error-prone DNA polymerase I. In: Proceedings of the National Academy of Sciences of the United States of America 100 (17), S. 9727–9732. DOI: 10.1073/pnas.1333928100.
  • Hasunuma, K.; Sekiguchi, M. (1977): Replication of plasmid pSC101 in Escherichia coli K12: requirement for dnaA function. In: Molecular & general genetics : MGG 154 (3), S. 225–230.
  • Illumina (2014): Understanding Illumina Quality Scores. Quality scores are an efficient way to communicate small error probabilities. Online verfügbar unter
  • Kornberg, A.; Baker, T. A. (1992): DNA replication. In: Freeman, New York.
  • Lee, Heewook; Popodi, Ellen; Tang, Haixu; Foster, Patricia L. (2012): Rate and molecular spectrum of spontaneous mutations in the bacterium Escherichia coli as determined by whole-genome sequencing. In: Proceedings of the National Academy of Sciences of the United States of America 109 (41), S. E2774-83. DOI: 10.1073/pnas.1210309109.
  • Troll, Christopher; Yoder, Jordan; Alexander, David; Hernandez, Jaime; Loh, Yueling; Camps, Manel (2014): The mutagenic footprint of low-fidelity Pol I ColE1 plasmid replication in E. coli reveals an extensive interplay between Pol I and Pol III. In: Current genetics 60 (3), S. 123–134. DOI: 10.1007/s00294-013-0415-9.
  • Xue, Yali; Wang, Qiuju; Long, Quan; Ng, Bee Ling; Swerdlow, Harold; Burton, John et al. (2009): Human Y chromosome base-substitution mutation rate measured by direct sequencing in a deep-rooting pedigree. In: Current biology : CB 19 (17), S. 1453–1457. DOI: 10.1016/j.cub.2009.07.032.
  • Zhu, Yuan O.; Siegal, Mark L.; Hall, David W.; Petrov, Dmitri A. (2014): Precise estimates of mutation rate and spectrum in yeast. In: Proceedings of the National Academy of Sciences of the United States of America 111 (22), S. E2310-8. DOI: 10.1073/pnas.1323011111.