Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published August 31, 2021 | Supplemental Material + Submitted
Report Open

The Specious Art of Single-Cell Genomics

Abstract

Dimensionality reduction is standard practice for filtering noise and identifying relevant dimensions in large-scale data analyses. In biology, single-cell expression studies almost always begin with reduction to two or three dimensions to produce 'all-in-one' visuals of the data that are amenable to the human eye, and these are subsequently used for qualitative and quantitative analysis of cell relationships. However, there is little theoretical support for this practice. We examine the theoretical and practical implications of low-dimensional embedding of single-cell data, and find extensive distortions incurred on the global and local properties of biological patterns relative to the high-dimensional, ambient space. In lieu of this, we propose semi-supervised dimension reduction to higher dimension, and show that such targeted reduction guided by the metadata associated with single-cell experiments provides useful latent space representations for hypothesis-driven biological discovery.

Additional Information

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license. Version 1: August 26, 2021; Version 2: September 21, 2021; Version 3: September 27, 2021. Some of the computations presented here were conducted using machines in the Resnick High Performance Center, a facility supported by the Resnick Sustainability Institute at the California Institute of Technology. We thank Gennady Gorin and Benjamin Riviere for helpful discussions regarding the MCML and Picasso analyses, Sina Booeshaghi for helpful discussions regarding NCA and dimensionality reduction, Ingileif Hallgrimsdottir for valuable feedback on the manuscript, and Pall Melsted for useful insights regarding Theorem 1. The work was supported in part by NIH grant U19MH114830 and Joeyta Banerjee was supported in part by the Caltech Summer Undergraduate Research Fellowship (SURF). Data Availability: Download links for the original data used to generate the figures and results in the paper are listed in Table 1. Processed and normalized versions of the count matrices are available on CaltechData, with links provided in Supplementary Table 1. Code Availability: All analysis code used to generate the figures and results in the paper is available at https://github.com/pachterlab/CBP_2021 with Picasso and MCML analyses provided in notebooks which can be run on Google Colab. Picasso is also available at https://github.com/pachterlab/picasso. The MCML method as well as tools for quantitative analysis are available via a Python pip installable package from https://github.com/pachterlab/MCML. Author Contributions: Conceived of the project: TC and LP Wrote scripts for processing the data and code for the analysis: TC and JB Developed the Google Colab notebooks: TC and JB Analyzed and interpreted the data: TC and LP Writing and editing the manuscript: TC and LP. The authors declare no competing interests.

Attached Files

Submitted - 2021.08.25.457696v3.full.pdf

Supplemental Material - media-1.pdf

Files

2021.08.25.457696v3.full.pdf
Files (41.1 MB)
Name Size Download all
md5:4c645ef19ae6c7aa685d9fe4815711bc
21.0 MB Preview Download
md5:5a94931f8e4c17f6ee47c2ffe4190fd5
20.1 MB Preview Download

Additional details

Created:
August 20, 2023
Modified:
December 22, 2023