Strategies and Tools for Machine Learning-Assisted Protein Engineering

Citation

Wittmann, Bruce James (2022) Strategies and Tools for Machine Learning-Assisted Protein Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/azzt-0q97. https://resolver.caltech.edu/CaltechTHESIS:05262022-234214451

Abstract

Proteins perform critical roles in a growing list of human-devised applications, and as demands for new applications arise, new proteins must be engineered to meet them. Machine learning-assisted protein engineering (MLPE) has recently arisen as a new philosophy of protein engineering, promising to overcome many of the limitations of existing engineering strategies. Despite its promise, however, as a relatively new approach to protein engineering, MLPE faces many challenges that hinder its routine application. This thesis is focused on addressing a number of them. Chapter 1 provides a theoretical overview of protein engineering, introduces the core steps of a typical MLPE pipeline, and discusses the challenges that currently hinder MLPE’s advancement. This chapter is written to be accessible to all members of the highly multidisciplinary audience that either use or develop MLPE tools, in turn providing a resource that eliminates the steep barrier to entry that can hinder broader participation in the field. Chapter 2 provides a solution to the challenge of applying MLPE to proteins whose fitness landscapes are dominated by “holes” (protein variants with zero or extremely low fitness). Using my development of the strategy “focused training machine learning-assisted directed evolution (ftMLDE)” as an example, I demonstrate how auxiliary information from protein sequence and structure can be used to navigate landscapes despite holes, in turn dramatically improving the efficiency of MLPE. Chapter 3 explores strategies for reducing the amount of sequence-fitness data needed for building MLPE models. Specifically, I detail the motivation behind and development of a new model designed to augment limited protein sequence-fitness datasets with information extracted from raw protein sequence and structure data. Finally, chapter 4 introduces “every variant sequencing” (evSeq), a collection of tools and protocols that enables extremely low-cost, routine collection of large protein sequence-fitness datasets. Not only does this technology drastically improve the financial feasibility of numerous MLPE applications, but it also potentiates the construction of a massive database of diverse protein sequence-fitness data, the likes of which would revolutionize our ability to engineer proteins with data-driven methods. Overall, the work described in this thesis advances both our understanding of MLPE and our ability to engineer proteins using it.

Item Type:

Thesis (Dissertation (Ph.D.))

Subject Keywords:

Machine Learning, Protein Engineering, Sequencing, Natural Language, Directed Evolution

Degree Grantor:

California Institute of Technology

Division:

Biology and Biological Engineering

Major Option:

Bioengineering

Thesis Availability:

Public (worldwide access)

Research Advisor(s):

Arnold, Frances Hamilton

Thesis Committee:

Pachter, Lior S. (chair)
Reisman, Sarah E.
Mayo, Stephen L.
Arnold, Frances Hamilton

Defense Date:

26 May 2022

Funders:

Funding Agency	Grant Number
NSF Division of Chemical, Bioengineering, Environmental and Transport Systems	CBET 1937902
Amgen Chem-Bio-Engineering Award	CBEA
U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences	DE-SC0022218
Camille and Henry Dreyfus Foundation	ML-20-194
Caltech Carver Mead New Adventure Seed Fund	UNSPECIFIED

Record Number:

CaltechTHESIS:05262022-234214451

Persistent URL:

https://resolver.caltech.edu/CaltechTHESIS:05262022-234214451

DOI:

10.7907/azzt-0q97

Related URLs:

URL	URL Type	Description
https://doi.org/10.1016/j.sbi.2021.01.008	DOI	Article adapted for chapters 1 and 3
https://doi.org/10.1016/j.cels.2021.07.008	DOI	Article adapted for chapter 2
https://doi.org/10.1021/acssynbio.1c00592	DOI	Article adapted for chapter 4
https://doi.org/10.1073/pnas.1901979116	DOI	First published work not included in thesis
https://doi.org/10.1021/acscatal.0c01888	DOI	Second published work not included in thesis
https://doi.org/10.1101/2021.11.09.467890	DOI	Third published work not included in thesis

ORCID:

Author	ORCID
Wittmann, Bruce James	0000-0001-8144-9157

Default Usage Policy:

No commercial reproduction, distribution, display or performance rights in this work are provided.

ID Code:

14631

Collection:

CaltechTHESIS

Deposited By:

Bruce Wittmann

Deposited On:

06 Jun 2022 17:59

Last Modified:

08 Nov 2023 00:11

Thesis Files

	PDF (Thesis) - Final Version See Usage Policy. 10MB
	MS Excel (Data S1.csv) - Supplemental Material See Usage Policy. 12kB
	MS Excel (Data S2.csv) - Supplemental Material See Usage Policy. 103kB
	MS Excel (Data S3.csv) - Supplemental Material See Usage Policy. 29kB

Repository Staff Only: item control page