BC4GO: a full-text corpus for the BioCreative IV GO task

Creators: Van Auken, Kimberly; Schaeffer, Mary L.; McQuilton, Peter; Laulederkind, Stanley J. F.; Li, Donghui; Wang, Shur-Jen; Hayman, G. Thomas; Tweedie, Susan; Arighi, Cecilia N.; Done, James; Müller, Hans-Michael; Sternberg, Paul W.; Mao, Yuqing; Wei, Chih-Hsuan; Lu, Zhiyong

Abstract

Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community.

Additional Information

© 2014 Oxford University Press. This work is written by US Government employees and is in the public domain in the US. Received 1 February 2014; Revised 1 July 2014; Accepted 3 July 2014. We would like to thank Don Comeau, Rezarta Dogan and John Wilbur for general discussion and technical assistance in using BioC, and in particular to Don Comeau for providing us source PMC articles in the BioC XML format. We also thank Lynette Hirschman, Cathy Wu, Kevin Cohen, Martin Krallinger and Thomas Wiegers from the BioCreative IV organizing committee for their support, and Judith Blake, Andrew Chatr-aryamontri, Sherri Matis, Fiona McCarthy, Sandra Orchard and Phoebe Roberts from the BioCreative IV User Advisory Group for their helpful discussions. Funding Intramural Research Program of the NIH, National Library of Medicine (to C.W., Y.M. and Z.L.), the USDA ARS (to M.L.S.), the National Human Genome Research Institute at the US National Institutes of Health (# HG004090, # HG002223 and # HG002273) and National Science Foundation (ABI-1062520, ABI-1147029 and DBI-0850319). Conflict of interest. None declared.

Attached Files

Published - Database-2014-Van_Auken-database_bau074.pdf

Files

Database-2014-Van_Auken-database_bau074.pdf

Files (958.8 kB)

Name	Size	Download all
Database-2014-Van_Auken-database_bau074.pdf md5:f5850d89ab1917f0cee55edb5bb59dc4	958.8 kB	Preview Download

Additional details

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes