SVFX: a machine learning framework to quantify the pathogenicity of structural variants
Abstract
There is a lack of approaches for identifying pathogenic genomic structural variants (SVs) although they play a crucial role in many diseases. We present a mechanism-agnostic machine learning-based workflow, called SVFX, to assign pathogenicity scores to somatic and germline SVs. In particular, we generate somatic and germline training models, which include genomic, epigenomic, and conservation-based features, for SV call sets in diseased and healthy individuals. We then apply SVFX to SVs in cancer and other diseases; SVFX achieves high accuracy in identifying pathogenic SVs. Predicted pathogenic SVs in cancer cohorts are enriched among known cancer genes and many cancer-related pathways.
Additional Information
© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Received 04 September 2019; Accepted 12 October 2020; Published 09 November 2020. We are thankful to the members of the PCAWG SV working group for generating the variant calls. We are also grateful to the Center for Common Disease and the Genome Sequencing Program consortium members for creating SV calls for the CVD and IBD cohort used in this study. In particular, the Mount Sinai BioMe Biobank has been supported by The Andrea and Charles Bronfman Philanthropies and in part by Federal funds from the NHLBI and NHGRI (U01HG00638001; U01HG007417; X01HL134588). We thank all participants in the Mount Sinai Biobank. We also thank all our recruiters who have assisted and continue to assist in data collection and management and are grateful for the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai. Similarly, IBD cohort data was generated as part of the The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) IBD Genetics Consortium (IBDGC) and International IBD Genetics Consortium (IIBDGC) supported by The Helmsley Charitable Trust and the Centers for Common Disease Genomes Program (NHGRI). DNA samples were obtained from the following collections: The Lunenfeld-Tanenbaum Research Institute Mount Sinai Hospital (PI: Mark Silverberg), The University of Pittsburgh School of Medicine (PI: Richard Duerr), The Emory University School of Medicine (PI: Subra Kugathasan), The Johns Hopkins Hospital (PI: Steven Brant), The Icahn School of Medicine at Mount Sinai (PI: Judy Cho), The Washington University School of Medicine (PI: Rodney Newberry), The University of Miami Miller School of Medicine (PI: Maria Abreu, Jake McCauley), and Cedars Sinai (PI: Dermot McGovern, Stephan Targan). Peer review information: Andrew Cosgrove was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The review history is available as Additional file 3. This work was supported by the National Institutes of Health (U24HG007497) grant and the AL Williams Professorship funds. Author information: Sushant Kumar and Arif Harmanci contributed equally to this work. Author Contributions: Conceptualization: MG, SK, and AH; methodology: SK and AH; investigation: SK, AH, and JV; writing—original draft: SK and MG; writing—review and editing: SK, AH, JV, and MG; supervision: SK and MG. All authors have read and approved the final manuscript. Ethics approval and consent to participate: Not applicable. The authors declare that they have no competing interests.Attached Files
Published - Kumar2020_Article_SVFXAMachineLearningFrameworkT.pdf
Submitted - 739474.full.pdf
Supplemental Material - 13059_2020_2178_MOESM1_ESM.xlsx
Supplemental Material - 13059_2020_2178_MOESM2_ESM.pdf
Supplemental Material - 13059_2020_2178_MOESM3_ESM.docx
Files
Name | Size | Download all |
---|---|---|
md5:ad2a53b578e9fb5808b0e1bb21a93ba8
|
5.9 MB | Preview Download |
md5:1c51d601d4e44a1123a5f52b7c6ca29e
|
982.7 kB | Download |
md5:d38e90cef274f7d333ffa14b201b987b
|
43.8 kB | Download |
md5:47859201a226733fca22781b816be64e
|
6.2 MB | Preview Download |
md5:936528a6bf9e1418f8698a471f810cc1
|
2.3 MB | Preview Download |
Additional details
- PMCID
- PMC7650198
- Eprint ID
- 97998
- Resolver ID
- CaltechAUTHORS:20190819-105323235
- Andrea and Charles Bronfman Philanthropies
- U01HG00638001
- NIH
- U01HG007417
- NIH
- X01HL134588
- NIH
- Helmsley Charitable Trust
- Centers for Common Disease Genomes Program (NHGRI)
- U24HG007497
- NIH
- A. L. Williams Professorship
- Created
-
2019-08-19Created from EPrint's datestamp field
- Updated
-
2023-06-01Created from EPrint's last_modified field