FLIP: Benchmark tasks in fitness landscape inference for proteins

Creators: Dallago, Christian; Mou, Jody; Johnston, Kadina E.; Wittmann, Bruce J.; Bhattacharya, Nicholas; Goldman, Samuel; Madani, Ali; Yang, Kevin K.

Abstract

Machine learning could enable an unprecedented level of control in protein engineering for therapeutic and industrial applications. Critical to its use in designing proteins with desired properties, machine learning models must capture the protein sequence-function relationship, often termed fitness landscape. Existing bench-marks like CASP or CAFA assess structure and function predictions of proteins, respectively, yet they do not target metrics relevant for protein engineering. In this work, we introduce Fitness Landscape Inference for Proteins (FLIP), a benchmark for function prediction to encourage rapid scoring of representation learning for protein engineering. Our curated tasks, baselines, and metrics probe model generalization in settings relevant for protein engineering, e.g. low-resource and extrapolative. Currently, FLIP encompasses experimental data across adeno-associated virus stability for gene therapy, protein domain B1 stability and immunoglobulin binding, and thermostability from multiple protein families. In order to enable ease of use and future expansion to new tasks, all data are presented in a standard format. FLIP scripts and data are freely accessible at https://benchmark.protein.properties.

Additional Information

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license. Version 1 - November 11, 2021; Version 2 - January 19, 2022. The authors thank Jeffrey Spencer, Sam Sinai, Sam Bowman, Roshan Rao and Debora Marks for ideas and discussions that helped us improve our work. The authors would also like to thank Helix and Murphy for careful attention to the manuscript. C.D. acknowledges support from the Bundesministerium für Bildung und Forschung (BMBF) – Project numbers: 01IS17049 and 031L0168. K.E.J. and B.J.W. acknowledge the NSF Division of Chemical, Bioengineering, Environmental and Transport Systems (1937902). N.B. was supported in part by NIH grant R35-GM134922 and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. S.G. thanks the MIT Machine Learning for Pharmaceutical Discovery and Synthesis Consortium for supporting this work. Competing Interest Statement: KKY was previously employed by Generate Biomedicines.

Attached Files

Submitted - 2021.11.09.467890v2.full.pdf

Supplemental Material - media-1.pdf

Files

media-1.pdf

Files (4.5 MB)

Name	Size	Download all
media-1.pdf md5:ebe41bf9230fee679ec79ba4df2b3b1a	1.5 MB	Preview Download
2021.11.09.467890v2.full.pdf md5:9fadc679276a70d3aafa91696624f08a	3.0 MB	Preview Download

Additional details

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes