Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published November 17, 2021 | Supplemental Material + Submitted
Journal Article Open

Informed training set design enables efficient machine learning-assisted directed protein evolution

Abstract

Directed evolution of proteins often involves a greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. The efficiency of such a single-step greedy walk depends on the order in which beneficial mutations are identified—the process is path dependent. Here, we investigate and optimize a path-independent machine learning-assisted directed evolution (MLDE) protocol that allows in silico screening of full combinatorial libraries. In particular, we evaluate the importance of different protein encoding strategies, training procedures, models, and training set design strategies on MLDE outcome, finding the most important consideration to be the implementation of strategies that reduce inclusion of minimally informative "holes" (protein variants with zero or extremely low fitness) in training data. When applied to an epistatic, hole-filled, four-site combinatorial fitness landscape, our optimized protocol achieved the global fitness maximum up to 81-fold more frequently than single-step greedy optimization. A record of this paper's transparent peer review process is included in the supplemental information.

Additional Information

© 2021 Elsevier Inc. Received 15 December 2020, Revised 6 May 2021, Accepted 26 July 2021, Available online 19 August 2021. The authors thank Sabine Brinkmann-Chen, Patrick Almhjell, and Lucas Schaus for helpful discussion and critical reading of the manuscript, Zachary Wu, Kadina Johnston, and Amir Motmaen for helpful discussion, Suresh Guptha for assistance with computational infrastructure development and maintenance, and Paul Chang for assistance with Triad calculations. Additionally, the authors thank NVIDIA Corporation for donation of two Titan V GPUs used in this work and Amazon.com for donation of Amazon web services (AWS) computing credits. This work was supported by the NSF Division of Chemical, Bioengineering, Environmental and Transport Systems (CBET 1937902) and by an Amgen Chem-Bio-Engineering Award (CBEA). Author contributions: Conceptualization, B.J.W., Y.Y., and F.H.A.; methodology, B.J.W. and Y.Y.; software, B.J.W.; validation, B.J.W.; formal analysis, B.J.W.; investigation, B.J.W.; writing – original draft, B.J.W., Y.Y., and F.H.A.; writing – review & editing, B.J.W., Y.Y., and F.H.A.; visualization, B.J.W. The authors declare no competing interests. Data and code availability: Data needed to replicate simulations have been deposited at Caltech Data and are publicly available as of the date of publication. DOIs are listed in the key resources table. The raw simulation data reported in this study cannot be deposited in a public repository because it is multiple terabytes in size. To request access, contact Bruce Wittmann at bwittman@caltech.edu. In addition, summary statistics describing these raw data have been deposited at Caltech Data and are publicly available as of the date of publication. DOIs are listed in the key resources table. This paper analyzes existing, publicly available data. The accession numbers for the datasets are listed in the key resources table. All original code has been deposited at Caltech Data and is publicly available as of the date of publication. DOIs are listed in the key resources table. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Attached Files

Submitted - 2020.12.04.408955v1.full.pdf

Supplemental Material - 1-s2.0-S2405471221002866-mmc1.pdf

Supplemental Material - 1-s2.0-S2405471221002866-mmc2.csv

Supplemental Material - 1-s2.0-S2405471221002866-mmc3.csv

Supplemental Material - 1-s2.0-S2405471221002866-mmc4.csv

Supplemental Material - 1-s2.0-S2405471221002866-mmc5.pdf

Files

1-s2.0-S2405471221002866-mmc5.pdf
Files (5.5 MB)
Name Size Download all
md5:b0c186df5e694f56c6a3251b1661dd81
765.4 kB Preview Download
md5:9d2f4d652b78629581ab4da84e697ca2
2.3 MB Preview Download
md5:028eebfb0c46f08194cecccc365a1755
103.3 kB Preview Download
md5:86b3c133cafe5cd9313e94537e8ca0f2
2.3 MB Preview Download
md5:d6ce32a11ae49c3efe6accf2b6b72c57
12.8 kB Preview Download
md5:503acf4a1da6e18e2f8071976d5b6856
29.8 kB Preview Download

Additional details

Created:
September 22, 2023
Modified:
December 22, 2023