Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published November 21, 2011 | Supplemental Material + Published
Journal Article Open

Identification and correction of systematic error in high-throughput sequence data

Abstract

Background: A feature common to all DNA sequencing technologies is the presence of base-call errors in the sequenced reads. The implications of such errors are application specific, ranging from minor informatics nuisances to major problems affecting biological inferences. Recently developed "next-gen" sequencing technologies have greatly reduced the cost of sequencing, but have been shown to be more error prone than previous technologies. Both position specific (depending on the location in the read) and sequence specific (depending on the sequence in the read) errors have been identified in Illumina and Life Technology sequencing platforms. We describe a new type of systematic error that manifests as statistically unlikely accumulations of errors at specific genome (or transcriptome) locations. Results: We characterize and describe systematic errors using overlapping paired reads from high-coverage data. We show that such errors occur in approximately 1 in 1000 base pairs, and that they are highly replicable across experiments. We identify motifs that are frequent at systematic error sites, and describe a classifier that distinguishes heterozygous sites from systematic error. Our classifier is designed to accommodate data from experiments in which the allele frequencies at heterozygous sites are not necessarily 0.5 (such as in the case of RNA-Seq), and can be used with single-end datasets. Conclusions: Systematic errors can easily be mistaken for heterozygous sites in individuals, or for SNPs in population analyses. Systematic errors are particularly problematic in low coverage experiments, or in estimates of allele-specific expression from RNA-Seq data. Our characterization of systematic error has allowed us to develop a program, called SysCall, for identifying and correcting such errors. We conclude that correction of systematic errors is important to consider in the design and interpretation of high-throughput sequencing experiments.

Additional Information

© 2011 Meacham et al. This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Received: 25 May 2011. Accepted: 21 November 2011. Published: 21 November 2011. We thank Professor Yun Song and Dr. Wei-Chun Kao from UC Berkeley for the phiX174 dataset and the associated naiveBayesCall output. Dario Boffelli was partially funded by NIH grant HL084474, David Martin by NIH grant ES016581, and Meromit Singer and Lior Pachter by NIH grant 1R01HG006129-01. Authors' contributions: FM, MS and LP formulated the problem of searching for systematic errors by studying discordant read pairs and designed a research plan. FM and MS conducted the research. DB, JD and DM performed the sequencing and contributed the datasets analyzed, and FM, MS and LP wrote the manuscript. All authors read and approved the final manuscript.

Attached Files

Published - art_3A10.1186_2F1471-2105-12-451.pdf

Supplemental Material - 12859_2011_5050_MOESM1_ESM.pdf

Supplemental Material - 12859_2011_5050_MOESM2_ESM.pdf

Supplemental Material - 12859_2011_5050_MOESM3_ESM.pdf

Supplemental Material - 12859_2011_5050_MOESM4_ESM.pdf

Supplemental Material - 12859_2011_5050_MOESM5_ESM.pdf

Supplemental Material - 12859_2011_5050_MOESM6_ESM.pdf

Supplemental Material - 12859_2011_5050_MOESM7_ESM.png

Supplemental Material - 12859_2011_5050_MOESM8_ESM.pdf

Files

12859_2011_5050_MOESM7_ESM.png
Files (1.9 MB)
Name Size Download all
md5:5d183487ec71eebcd9ac1321fea5ef3c
9.7 kB Preview Download
md5:17bee9507adb466bdda7e180381fd5a1
50.2 kB Preview Download
md5:2e08c77d91ef649849ae74a766885f98
118.2 kB Preview Download
md5:b323cbca307de5325159eb3cf50e9f37
35.9 kB Preview Download
md5:b184069c44417430960f2ec9bb82d553
42.8 kB Preview Download
md5:e38b4fe6c9eae70867e8f9d9591ef61b
148.6 kB Preview Download
md5:cfeb128cfb179a742446d0566da3fb51
38.8 kB Preview Download
md5:e3c562097e3de2fc57d8f7a3efb433a3
991.1 kB Preview Download
md5:2e5b2bd934b979e6110bca5ba15735b8
467.6 kB Preview Download

Additional details

Created:
August 19, 2023
Modified:
October 24, 2023