A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential
Abstract
The current deluge of newly identified RNA transcripts presents a singular opportunity for improved assessment of coding potential, a cornerstone of genome annotation, and for machine-driven discovery of biological knowledge. While traditional, feature-based methods for RNA classification are limited by current scientific knowledge, deep learning methods can independently discover complex biological rules in the data de novo. We trained a gated recurrent neural network (RNN) on human messenger RNA (mRNA) and long noncoding RNA (lncRNA) sequences. Our model, mRNA RNN (mRNN), surpasses state-of-the-art methods at predicting protein-coding potential despite being trained with less data and with no prior concept of what features define mRNAs. To understand what mRNN learned, we probed the network and uncovered several context-sensitive codons highly predictive of coding potential. Our results suggest that gated RNNs can learn complex and long-range patterns in full-length human transcripts, making them ideal for performing a wide range of difficult classification tasks and, most importantly, for harvesting new biological insights from the rising flood of sequencing data.
Additional Information
© The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. Received April 11, 2018; Revised May 20, 2018; Editorial Decision June 07, 2018; Accepted June 15, 2018; Published: 09 July 2018. The authors would like to thank Prof. Stephen Ramsey, Prof. Christopher K. Mathews, Prof. Liang Huang, Prof. Colin Johnson, Prof. P. Andy Karplus and Prof. Michael Freitag for feedback on the manuscript and helpful discussions. The authors thank Mike Tyka for the suggestion to use data augmentation. Authors' contribution: S.H., R.K., E.M., A.T. and D.H. wrote the software. S.H., R.K., A.T., P.D. and D.H. did the bioinformatics analysis. R.K., D.H. and S.H. wrote the manuscript. Funding: NIH [R56 AG053460, R21 AG052950]; Oregon State University (start-up grant). Funding for open access charge: NIH [R56 AG053460]. Conflict of interest statement: None declared.Attached Files
Published - gky567.pdf
Submitted - 200758.1.full.pdf
Supplemental Material - gky567_supplemental_files.docx
Files
Additional details
- PMCID
- PMC6144860
- Eprint ID
- 90443
- Resolver ID
- CaltechAUTHORS:20181026-154742624
- R56 AG053460
- NIH
- R21 AG052950
- NIH
- Oregon State University
- Created
-
2018-10-26Created from EPrint's datestamp field
- Updated
-
2023-06-01Created from EPrint's last_modified field