Accelerated variant curation from scientific literature using biomedical text mining
Abstract
Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at: https://github.com/WormBase/genomic-info-from-papers
Additional Information
© 2022 by the authors. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Received: 2/8/2022. Revision Received: 5/19/2022. Accepted: 6/1/2022. Published: 6/1/2022. Funding for WormBase is from US National Human Genome Research Institute [U24 HG002223]; UK Medical Research Council [MR/S000453/1]; UK Biotechnology and Biological Sciences Research Council [BB/P024610/1, BB/P024602/1]. Rishab Mallick was a participant in the Google Summer of Code 2021 program. Author Contributions. Rishab Mallick: Writing - original draft, Methodology, Investigation, Visualization Valerio Arnaboldi: Conceptualization, Supervision, Software, Writing - review & editing Paul Davis: Data curation, Validation Stavros Diamantakis: Data curation, Validation Magdalena Zarowiecki: Conceptualization, Data curation, Funding acquisition, Project administration, Supervision, Writing - review & editing Kevin Howe: Funding acquisition, Supervision, Writing - review & editing.Attached Files
Published - micropub-biology-000578.pdf
Files
Name | Size | Download all |
---|---|---|
md5:840c193c001c78c1da03cae5ec546394
|
357.0 kB | Preview Download |
Additional details
- PMCID
- PMC9160977
- Eprint ID
- 115036
- Resolver ID
- CaltechAUTHORS:20220606-736182000
- NIH
- U24 HG002223
- Medical Research Council (UK)
- MR/S000453/1
- Biotechnology and Biological Sciences Research Council (BBSRC)
- BB/P024610/1
- Biotechnology and Biological Sciences Research Council (BBSRC)
- BB/P024602/1
- Google Summer of Code 2021
- Created
-
2022-06-07Created from EPrint's datestamp field
- Updated
-
2023-06-01Created from EPrint's last_modified field
- Caltech groups
- WormBase