Published July 2021
| Supplemental Material
Journal Article
Open
Modular, efficient and constant-memory single-cell RNA-seq preprocessing
Chicago
Abstract
We describe a workflow for preprocessing of single-cell RNA-sequencing data that balances efficiency and accuracy. Our workflow is based on the kallisto and bustools programs, and is near optimal in speed with a constant memory requirement providing scalability for arbitrarily large datasets. The workflow is modular, and we demonstrate its flexibility by showing how it can be used for RNA velocity analyses.
Additional Information
© 2021 Nature Publishing Group. Received 07 August 2019; Accepted 09 February 2021; Published 01 April 2021. We thank V. Ntranos and V. Svensson for helpful suggestions and comments. We thank J. Farrell for the D. rerio gene annotation used to process SRR6956073, J. Schiefelbein for the A. thaliana gene annotation used to process SRR8257100, J. Fear for the D. melanogaster gene annotation used to process SRR8513910, and J. Kim and Q. Zhu for the C. elegans gene annotation used to process SRR8611943. The benchmarking work was made possible, in part, thanks to support from the Beckman Institute Caltech Bioinformatics Resource Center. A.S.B. and L.P. were funded in part by NIH U19MH114830. Data availability: A diverse set of 20 datasets was compiled for the purpose of benchmarking preprocessing workflows. Datasets produced and distributed by 10x Genomics were downloaded from the 10x Genomics data downloads page: https://support.10xgenomics.com/single-cell-gene-expression/datasets. Six v3 chemistry datasets and two v2 chemistry datasets were downloaded and processed (Supplementary Table 3). Another 12 datasets were obtained from either the SRA or the European Nucleotide Archive; all were produced with 10x Genomics v2 chemistry. For six of the datasets (SRR6956073, SRR6998058, SRR7299563, SRR8206317, SRR8327928 and SRR8524760), the BAM files were downloaded and the Cell Ranger utility bamtofastq was run to produce FASTQ files for preprocessing from Cell Ranger–structured BAM files. FASTQ files were downloaded directly for the datasets E-MTAB-7320, SRR8257100, SRR8513910, SRR8599150 (available at https://github.com/bustools/getting_started/releases/download/getting_started/SRR8599150_S1_L001_R1_001.fastq.gz and https://github.com/bustools/getting_started/releases/download/getting_started/SRR8599150_S1_L001_R2_001.fastq.gz), SRR8611943 and SRR8639063. Code availability: The software versions used for the results in the paper were: Alevin v0.13.1, bustools v0.39.1, Cell Ranger v3.0.0, DropletUtils v1.6.1, kallisto v0.46.0, Python 3.7, R v3.5.2, Scanpy v1.4.1, scvelo 0.1.17, Seurat v3.0, snakemake v5.3.0, STARsolo v2.7.0e, velocyto v0.17.17, wc v8.22 (GNU coreutils) and zcat v1.5 (gzip). All programs were run with default options unless otherwise specified. The code to reproduce the findings of this paper is available at https://github.com/pachterlab/MBLGLMBHGP_2021/, kallisto is available at https://github.com/pachterlab/kallisto/ and bustools is available at https://github.com/BUStools/bustools/. Documentation and tutorials for using the kallisto bustools scRNA-seq workflow are available at http://pachterlab.github.io/kallistobustools. Details of all datasets and their accession numbers can be found in Supplementary Table 3. All genome annotations and reference transcriptomes can be found at https://doi.org/10.22002/D1.1876. These authors contributed equally: Páll Melsted, A. Sina Booeshaghi. Author Contributions: P.M., A.S.B., L. Liu and L.P. developed the algorithms for bustools and P.M., A.S.B. and L. Liu wrote the software. A.S.B. conceived of and performed the UMI and barcode calculations motivating the algorithms. F.G. implemented and performed the benchmarking procedure, and curated indices for the datasets. A.S.B. and E.d.V.B. designed and produced the comparisons between Cell Ranger and kallisto bustools. L. Lu investigated in detail the performance of different workflows on the "10k mouse neuron" data and produced the analysis of that dataset. A.S.B. designed the RNA velocity workflow and performed the RNA velocity analyses. K.M.H contributed to the development of the reproducible workflow. K.E.H. developed and investigated the effect of reference transcriptome sequences for pseudoalignment. J.G. interpreted results and helped to supervise the research. A.S.B. planned, organized and prepared figures. A.S.B., E.d.V.B., P.M. and L.P. planned the manuscript. A.S.B. and L.P. wrote the manuscript. The authors declare no competing interests. Peer review information: Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.Attached Files
Supplemental Material - 41587_2021_870_MOESM1_ESM.pdf
Supplemental Material - 41587_2021_870_MOESM2_ESM.pdf
Supplemental Material - 41587_2021_870_MOESM3_ESM.xlsx
Supplemental Material - 41587_2021_870_MOESM4_ESM.xlsx
Files
41587_2021_870_MOESM2_ESM.pdf
Additional details
- Eprint ID
- 108622
- Resolver ID
- CaltechAUTHORS:20210405-142728694
- Caltech Beckman Institute
- NIH
- U19MH114830
- Created
-
2021-04-07Created from EPrint's datestamp field
- Updated
-
2021-07-13Created from EPrint's last_modified field
- Caltech groups
- Division of Biology and Biological Engineering (BBE)