Overview of the gene ontology task at BioCreative IV

Creators: Mao, Yuqing; Van Auken, Kimberly

Abstract

Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.

Additional Information

© 2014 Oxford University Press. This work is written by US Government employees and is in the public domain in the US. Received 10 February 2014; Revised 28 July 2014; Accepted 29 July 2014. Published online August 25, 2014. The authors would like to thank Lynette Hirschman, John Wilbur, Cathy Wu, Kevin Cohen, Martin Krallinger and Thomas Wiegers from the BioCreative IV organizing committee for their support, and Judith Blake, Andrew Chatr-aryamontri, Sherri Matis, Fiona McCarthy, Sandra Orchard and Phoebe Roberts from the BioCreative IV User Advisory Group for their helpful discussions. This research is supported by NIH Intramural Research Program, National Library of Medicine (Y.M. and Z.L.). The BioCreative IV Workshop is funded by NSF/DBI-0850319. WormBase is funded by National Human Genome Research Institute [U41-HG002223] and the Gene Ontology Consortium by National Human Genome Research Institute (NHGRI) [U41-HG002273]. FlyBase is funded by an NHGRI/NIH grant [U41-HG000739] and the UK Medical Research Council [G1000968]. Team 238 is funded by NSF/ABI-0845523 (H.L. and D.Z.), NIH R01LM009959A1 (H.L. and D.Z.). The SIBtex (Swiss Institute of Bioinformatics) team has been partially supported by the SNF (neXtpresso #153437) and the European Union (Khresmoi #257528). Conflict of interest. None declared.

Attached Files

Published - Database-2014-Mao-database-bau086.pdf

Files

Database-2014-Mao-database-bau086.pdf

Files (496.5 kB)

Name	Size	Download all
Database-2014-Mao-database-bau086.pdf md5:e1b8e967146a0394c1ac291fd1cb993b	496.5 kB	Preview Download

Additional details

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes