Barcode identification for single cell genomics

Creators: Tambe, Akshay; Pachter, Lior

Abstract

Background: Single-cell sequencing experiments use short DNA barcode 'tags' to identify reads that originate from the same cell. In order to recover single-cell information from such experiments, reads must be grouped based on their barcode tag, a crucial processing step that precedes other computations. However, this step can be difficult due to high rates of mismatch and deletion errors that can afflict barcodes. Results: Here we present an approach to identify and error-correct barcodes by traversing the de Bruijn graph of circularized barcode k-mers. Our approach is based on the observation that circularizing a barcode sequence can yield error-free k-mers even when the size of k is large relative to the length of the barcode sequence, a regime which is typical single-cell barcoding applications. This allows for assignment of reads to consensus fingerprints constructed from k-mers. Conclusion: We show that for single-cell RNA-Seq circularization improves the recovery of accurate single-cell transcriptome estimates, especially when there are a high number of errors per read. This approach is robust to the type of error (mismatch, insertion, deletion), as well as to the relative abundances of the cells. Sircel, a software package that implements this approach is described and publically available.

Additional Information

© 2019 The Author(s). This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Received: 23 May 2017; Accepted: 7 January 2019; Published: 17 January 2019. We thank Jase Gehring and Vasilis Ntranos for helpful comments and feedback during the development of the method. Funding: None. Availability of data and materials: The datasets analyzed here were obtained from previously published datasets, which are available at the NCBI Sequence Read Archive. SRA ascension numbers used in this paper are SRR1873277 and SRR5250839. Authors' contributions: AT and LP conceived of the project. AT wrote the software and analyzed data. AT and LP wrote the manuscript. All authors read and approved the final manuscript. Ethics approval: Not applicable. Consent for publication: Not applicable. The authors declare that they have no competing interests.

Attached Files

Published - s12859-019-2612-0.pdf

Submitted - 136242.full.pdf

Supplemental Material - 12859_2019_2612_MOESM1_ESM.pdf

Files

136242.full.pdf

Files (11.5 MB)

Name	Size	Download all
136242.full.pdf md5:642331a737721ae6907ce851f66d35a7	7.2 MB	Preview Download
12859_2019_2612_MOESM1_ESM.pdf md5:c63f98a6c049f41de61c446991be1201	2.5 MB	Preview Download
s12859-019-2612-0.pdf md5:c2588a27ac9ac996b328ee35387633e8	1.7 MB	Preview Download

Additional details

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes