Informed training set design enables efficient machine learning-assisted directed protein evolution
Abstract
Directed evolution of proteins often involves a greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. The efficiency of such a single-step greedy walk depends on the order in which beneficial mutations are identified—the process is path dependent. Here, we investigate and optimize a path-independent machine learning-assisted directed evolution (MLDE) protocol that allows in silico screening of full combinatorial libraries. In particular, we evaluate the importance of different protein encoding strategies, training procedures, models, and training set design strategies on MLDE outcome, finding the most important consideration to be the implementation of strategies that reduce inclusion of minimally informative "holes" (protein variants with zero or extremely low fitness) in training data. When applied to an epistatic, hole-filled, four-site combinatorial fitness landscape, our optimized protocol achieved the global fitness maximum up to 81-fold more frequently than single-step greedy optimization. A record of this paper's transparent peer review process is included in the supplemental information.
Additional Information
© 2021 Elsevier Inc. Received 15 December 2020, Revised 6 May 2021, Accepted 26 July 2021, Available online 19 August 2021. The authors thank Sabine Brinkmann-Chen, Patrick Almhjell, and Lucas Schaus for helpful discussion and critical reading of the manuscript, Zachary Wu, Kadina Johnston, and Amir Motmaen for helpful discussion, Suresh Guptha for assistance with computational infrastructure development and maintenance, and Paul Chang for assistance with Triad calculations. Additionally, the authors thank NVIDIA Corporation for donation of two Titan V GPUs used in this work and Amazon.com for donation of Amazon web services (AWS) computing credits. This work was supported by the NSF Division of Chemical, Bioengineering, Environmental and Transport Systems (CBET 1937902) and by an Amgen Chem-Bio-Engineering Award (CBEA). Author contributions: Conceptualization, B.J.W., Y.Y., and F.H.A.; methodology, B.J.W. and Y.Y.; software, B.J.W.; validation, B.J.W.; formal analysis, B.J.W.; investigation, B.J.W.; writing – original draft, B.J.W., Y.Y., and F.H.A.; writing – review & editing, B.J.W., Y.Y., and F.H.A.; visualization, B.J.W. The authors declare no competing interests. Data and code availability: Data needed to replicate simulations have been deposited at Caltech Data and are publicly available as of the date of publication. DOIs are listed in the key resources table. The raw simulation data reported in this study cannot be deposited in a public repository because it is multiple terabytes in size. To request access, contact Bruce Wittmann at bwittman@caltech.edu. In addition, summary statistics describing these raw data have been deposited at Caltech Data and are publicly available as of the date of publication. DOIs are listed in the key resources table. This paper analyzes existing, publicly available data. The accession numbers for the datasets are listed in the key resources table. All original code has been deposited at Caltech Data and is publicly available as of the date of publication. DOIs are listed in the key resources table. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.Attached Files
Submitted - 2020.12.04.408955v1.full.pdf
Supplemental Material - 1-s2.0-S2405471221002866-mmc1.pdf
Supplemental Material - 1-s2.0-S2405471221002866-mmc2.csv
Supplemental Material - 1-s2.0-S2405471221002866-mmc3.csv
Supplemental Material - 1-s2.0-S2405471221002866-mmc4.csv
Supplemental Material - 1-s2.0-S2405471221002866-mmc5.pdf
Files
Name | Size | Download all |
---|---|---|
md5:b0c186df5e694f56c6a3251b1661dd81
|
765.4 kB | Preview Download |
md5:9d2f4d652b78629581ab4da84e697ca2
|
2.3 MB | Preview Download |
md5:028eebfb0c46f08194cecccc365a1755
|
103.3 kB | Preview Download |
md5:86b3c133cafe5cd9313e94537e8ca0f2
|
2.3 MB | Preview Download |
md5:d6ce32a11ae49c3efe6accf2b6b72c57
|
12.8 kB | Preview Download |
md5:503acf4a1da6e18e2f8071976d5b6856
|
29.8 kB | Preview Download |
Additional details
- Alternative title
- Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden
- Eprint ID
- 106948
- DOI
- 10.1016/j.cels.2021.07.008
- Resolver ID
- CaltechAUTHORS:20201207-131007947
- NVIDIA Corporation
- Amazon Web Services
- NSF
- CBET-1937902
- Amgen
- Created
-
2020-12-07Created from EPrint's datestamp field
- Updated
-
2021-11-18Created from EPrint's last_modified field
- Caltech groups
- Division of Biology and Biological Engineering (BBE)