Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published June 1, 2022 | Published
Journal Article Open

Accelerated variant curation from scientific literature using biomedical text mining

Abstract

Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at: https://github.com/WormBase/genomic-info-from-papers

Additional Information

© 2022 by the authors. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Received: 2/8/2022. Revision Received: 5/19/2022. Accepted: 6/1/2022. Published: 6/1/2022. Funding for WormBase is from US National Human Genome Research Institute [U24 HG002223]; UK Medical Research Council [MR/S000453/1]; UK Biotechnology and Biological Sciences Research Council [BB/P024610/1, BB/P024602/1]. Rishab Mallick was a participant in the Google Summer of Code 2021 program. Author Contributions. Rishab Mallick: Writing - original draft, Methodology, Investigation, Visualization Valerio Arnaboldi: Conceptualization, Supervision, Software, Writing - review & editing Paul Davis: Data curation, Validation Stavros Diamantakis: Data curation, Validation Magdalena Zarowiecki: Conceptualization, Data curation, Funding acquisition, Project administration, Supervision, Writing - review & editing Kevin Howe: Funding acquisition, Supervision, Writing - review & editing.

Attached Files

Published - micropub-biology-000578.pdf

Files

micropub-biology-000578.pdf
Files (357.0 kB)
Name Size Download all
md5:840c193c001c78c1da03cae5ec546394
357.0 kB Preview Download

Additional details

Created:
August 20, 2023
Modified:
October 23, 2023