Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data

Creators: McGee, Warren A.; Pimentel, Harold; Pachter, Lior; Wu, Jane Y.

Abstract

*Seq techniques (e.g. RNA-Seq) generate compositional datasets, i.e. the number of fragments sequenced is not proportional to the total RNA present. Thus, datasets carry only relative information, even though absolute RNA copy numbers are often of interest. Current normalization methods assume most features are not changing, which can lead to misleading conclusions when there are large shifts. However, there are few real datasets and no simulation protocols currently available that can directly benchmark methods when such large shifts occur. We present absSimSeq, an R package that simulates compositional data in the form of RNA-Seq reads. We tested several tools used for RNA-Seq differential analysis: sleuth, DESeq2, edgeR, limma, sleuth and ALDEx2 (which explicitly takes a compositional approach). For these tools, we compared their standard normalization to either "compositional normalization", which uses log-ratios to anchor the data on a set of negative control features, or RUVSeq, another tool that directly uses negative control features. We show that common normalizations result in reduced performance with current methods when there is a large change in the total RNA per cell. Performance improves when spike-ins are included and used by a compositional approach, even if the spike-ins have substantial variation. In contrast, RUVSeq, which normalizes count data rather than compositional data, has poor performance. Further, we show that previous criticisms of spike-ins did not take into account the compositional nature of the data. We conclude that absSimSeq can generate more representative datasets for testing performance, and that spike-ins should be more broadly used in a compositional manner to minimize misleading conclusions from differential analyses.

Additional Information

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license. We are grateful to Rosemary Braun and David Kuo for helpful suggestions and critical reading of the manuscript. WAM and JYW are supported by the NIH (F30 NS090893 to WAM; R01CA175360 and RO1NS107396 to JYW). HP is supported by the Howard Hughes Medical Institute Hanna Gray Fellowship. Author's Contributions: WAM conceived the idea, designed the approach, and wrote the software for sleuth-ALR and absSimSeq. WAM and HP wrote the code for the analysis pipeline. JYW and LP provided supervision. WAM and JYW wrote the manuscript. Availability of data and code: The yeast starvation dataset was taken from Marguerat et al [26] from ArrayExpress at accession E-MTAB-1154, and the absolute counts were taken from Supplementary Table S2 from [26]. The GEUVADIS Finnish data can be found at ArrayExpress using accession E-GEUV-1, using the samples with the population code "FIN" and sex "female". The Bottomly et al data [35] can be found on the Sequence Read Archive (SRA) using the accession SRP004777. Human annotations were taken from Gencode v. 25 and Ensembl v. 87, mouse annotations were taken from Gencode v. M12 and Ensembl v. 87, and yeast annotations were taken from Ensembl Genomes Fungi release 37. The code and vignette for absSimSeq can be found on GitHub at www.github.com/warrenmcg/absSimSeq, the code and vignette for using sleuth-ALR can be found at www.github.com/warrenmcg/sleuth-ALR, and the full code to reproduce the analyses in this paper can be found at www.github.com/warrenmcg/sleuthALR_paper_analysis. Here are the versions of each of the software used: kallisto v. 0.44.0, limma v. 3.34.9, edgeR v. 3.20.9, RUVSeq 1.12.0, and DESeq2 1.18.1; the version of polyester used is a forked branch that modified version 1.14.1 with significant speed improvements (found here: www.github.com/warrenmcg/polyester); the version of sleuth used is a forked branch that modified version 0.29.0 with speed improvements and modifications to allow for sleuth-ALR (found here: www.github.com/warrenmcg/sleuth/tree/speedy_fit); the version of ALDEx2 used is a forked branch that modified version 1.10.0 to make some speed improvements and to fix a bug that prevented getting effects if the ALR transformation with one feature was used (found here: www.github.com/warrenmcg/ALDEx2). All R code was run using R version 3.4.4, and the full pipeline was run using snakemake. The authors declare no competing financial interests.

Attached Files

Submitted - 564955.full.pdf

Files

564955.full.pdf

Files (1.4 MB)

Name	Size	Download all
564955.full.pdf md5:2b8ae032a9fa86bce1be1c289faaacc6	1.4 MB	Preview Download

Additional details

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes