Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published June 2022 | Supplemental Material + Published
Journal Article Open

Minimal gene set discovery in single-cell mRNA-seq datasets with ActiveSVM

Abstract

Sequencing costs currently prohibit the application of single-cell mRNA-seq to many biological and clinical analyses. Targeted single-cell mRNA-sequencing reduces sequencing costs by profiling reduced gene sets that capture biological information with a minimal number of genes. Here we introduce an active learning method that identifies minimal but highly informative gene sets that enable the identification of cell types, physiological states and genetic perturbations in single-cell data using a small number of genes. Our active feature selection procedure generates minimal gene sets from single-cell data by employing an active support vector machine (ActiveSVM) classifier. We demonstrate that ActiveSVM feature selection identifies gene sets that enable ~90% cell-type classification accuracy across, for example, cell atlas and disease-characterization datasets. The discovery of small but highly informative gene sets should enable reductions in the number of measurements necessary for application of single-cell mRNA-seq to clinical tests, therapeutic discovery and genetic screens.

Additional Information

© The Author(s) 2022, corrected publication 2022. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Received 23 July 2021. Accepted 17 May 2022. Published 27 June 2022. We would like to thank I.-M. Strazhnik for expert assistance with preparation of illustrations and G. Riddihough of Life Science Editors for Editorial Assistance. We thank J. Jiang, Y. Yue, L. Cai, D. Sivak, D. Angeles and K. Zinn for discussion. The work was supported by the Heritage Medical Research Institute, the Beckman Institute Single-cell Profiling and Engineering Center (SPEC), NIH (R01HD100039), and the The Margaret E. Early Medical Research Trust. Contributions. X.C. conceived the ActiveSVM algorithm. X.C. and M.T. refined the algorithm and developed the application to single-cell genomics. X.C., S.C. and M.T. performed numerical experiments, biological interpretation, and data analysis. S.C. analyzed the Tabula Muris and multiple myeloma datasets and established biological interpretation of ActiveSVM results. X.C., S.C. and M.T. wrote the paper. Data availability All of the data used in the paper have been previously published. The PBMC Single-cell RNA-seq data have been deposited in the Short Read Archive under accession no. SRP073767 by the authors of ref. 13. Data are also available at http://support.10xgenomics.com/single-cell/datasets. The original Tabula Muris dataset is available at https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733. The original multiple myeloma PBMC data, which contain two healthy donors and four multiple myeloma donors, are available at https://figshare.com/articles/dataset/PopAlign_Data/11837097/3. The 10x Genomics Megacell dataset is available at http://support.10xgenomics.com/single-cell/datasets. The perturb-seq dataset17 is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2396856 The spatial transcriptomics data18 are available at https://github.com/CaiGroup/seqFISH-PLUS. Source Data are provided with this paper. Code availability Our method is integrated as an installable Python package called ActiveSVC. The installation instructions and user guidance are shown at https://pypi.org/project/activeSVC. The source codes of activeSVC and some demo examples are publicly available on GitHub at https://github.com/xqchen/activeSVC and Zenodo56. We also created a Google colaboratory project demonstrating three examples: the PBMC demo is at https://colab.research.google.com/drive/16h8hsnJ3ukTWAPnCB581dwj-nN5oopyM?usp=sharing, the Tabula Muris demo is at https://colab.research.google.com/drive/1SLehIKIQqpjK6BzEKc9m0y3uJ_LBqRzA?usp=sharing, and the PBMC cross-validation57 demo is at https://colab.research.google.com/drive/1fhQ8GD3NyzB3w0vof9WimXK6BLqDNuDC?usp=sharing. The authors declare no completing interests. Peer review. Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Errata

04 July 2022In the version of this article initially published, edited Source Data captions for Fig. 6 and Extended Data Fig. 1 mistakenly referred to "Extended Data" figures rather than folders with "ED" prefixes in the data. The captions have been corrected in the HTML version of the article.

Attached Files

Published - s43588-022-00263-8.pdf

Supplemental Material - 43588_2022_263_Fig7_ESM.jpg

Supplemental Material - 43588_2022_263_MOESM1_ESM.pdf

Supplemental Material - 43588_2022_263_MOESM2_ESM.zip

Supplemental Material - 43588_2022_263_MOESM3_ESM.zip

Supplemental Material - 43588_2022_263_MOESM4_ESM.zip

Supplemental Material - 43588_2022_263_MOESM5_ESM.zip

Supplemental Material - 43588_2022_263_MOESM6_ESM.zip

Supplemental Material - 43588_2022_263_MOESM7_ESM.zip

Files

43588_2022_263_MOESM5_ESM.zip
Files (153.2 MB)
Name Size Download all
md5:6f1d6d45ce2fd2308889cb66d5033145
8.4 MB Preview Download
md5:f47c7e1b34271f44081fe5e4cd0bcdb0
1.3 MB Preview Download
md5:3d4f203415a9f7a9620130c8e44970dd
71.3 kB Preview Download
md5:ffd776bbbe48bece6473b3d722f4aeb2
109.0 MB Preview Download
md5:17ee2f5a295f718ec70a27b2e2e94e06
4.0 MB Preview Download
md5:f14eb263a2718acf5b44c7c4fc3ea08c
6.8 MB Preview Download
md5:278d129e6a180a01ce96c9429e5a9584
711.0 kB Preview Download
md5:23cb26a08de485d06d9500240bc632c8
9.5 MB Preview Download
md5:fab5754329dba6f6301a43115b2bbd88
13.2 MB Preview Download

Additional details

Created:
August 22, 2023
Modified:
December 22, 2023