Decoding the Past

Citation

Jain, Siddharth (2019) Decoding the Past. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/K286-5N63. https://resolver.caltech.edu/CaltechTHESIS:04032019-102853075

Abstract

The human genome is continuously evolving, hence the sequenced genome is a snapshot in time of this evolving entity. Over time, the genome accumulates mutations that can be associated with different phenotypes - like physical traits, diseases, etc. Underlying mutation accumulation is an evolution channel (the term channel is motivated by the notion of communication channel introduced by Shannon [1] in 1948 and started the area of Information Theory), which is controlled by hereditary, environmental, and stochastic factors. The premise of this thesis is to understand the human genome using information theory framework. In particular, it focuses on: (i) the analysis and characterization of the evolution channel using measures of capacity, expressiveness, evolution distance, and uniqueness of ancestry and uses these insights for (ii) the design of error correcting codes for DNA storage, (iii) inversion symmetry in the genome and (iv) cancer classification.

The mutational events characterizing this evolution channel can be divided into two categories, namely point mutations and duplications. While evolution through point mutations is unconstrained, giving rise to combinatorially many possibilities of what could have happened in the past, evolution through duplications adds constraints limiting the number of those possibilities. Further, more than 50% of the genome has been observed to consist of repeated sequences. We focus on the much constrained form of duplications known as tandem duplications in order to understand the limits of evolution by duplication. Our sequence evolution model consists of a starting sequence called seed and a set of tandem duplication rules. We find limits on the diversity of sequences that can be generated by tandem duplications using measures of capacity and expressiveness. Additionally, we calculate bounds on the duplication distance which is used to measure the timing of generation by these duplications. We also ask questions about the uniqueness of seed for a given sequence and completely characterize the duplication length sets where the seed is unique or non-unique. These insights also led us to design error correcting codes for any number of tandem duplication errors that are useful for DNA-storage based applications. For uniform duplication length and duplication length bounded by 2, our designed codes achieve channel capacity. We also define and measure uncertainty in decoding when the duplication channel is misinformed. Moreover, we add substitutions to our tandem duplication model and calculate sequence generation diversity for a given budget of substitutions.

We also use our duplication model to explain the inversion symmetry observed in the genome of many species. The inversion symmetry is popularly known as the 2nd Chargaff Rule, according to which in a single strand DNA, the frequency of a k-mer is almost the same as the frequency of its reverse complement. The insights gained by these problems led us to investigate the tandem repeat regions in the genome. Tandem repeat regions in the genome can be traced back in time algorithmically to make inference about the effect of the hereditary, environmental and stochastic factors on the mutation rate of the genome. By inferring the evolutionary history of the tandem repeat regions, we show how this knowledge can be used to make predictions about the risk of incurring a mutation based disease, specifically cancer. More precisely, we introduce the concept of mutation profiles that are computed without any comparative analysis, but instead by analyzing the short tandem repeat regions in a single healthy genome and capturing information about the individual's evolution channel. Using gradient boosting on data from more than 5,000 TCGA (The Cancer Genome Atlas) cancer patients, we demonstrate that these mutation profiles can accurately distinguish between patients with various types of cancer. For example, the pairwise validation accuracy of the classifier between PAAD (pancreas) patients and GBM (brain) patients is 93%. Our results show that healthy unaffected cells still contain a cancer-specific signal, which opens the possibility of cancer prediction from a healthy genome.

Item Type:

Thesis (Dissertation (Ph.D.))

Subject Keywords:

DNA, Information Theory, Constrained Systems, Tandem Duplications, Tandem Repeats, DNA Storage, Cancer Classification, Chargaff Rule

Degree Grantor:

California Institute of Technology

Division:

Engineering and Applied Science

Major Option:

Electrical Engineering

Thesis Availability:

Public (worldwide access)

Research Advisor(s):

Bruck, Jehoshua

Thesis Committee:

Vaidyanathan, P. P. (chair)
Bruck, Jehoshua
Hassibi, Babak
Winfree, Erik
Schwartz, Moshe

Defense Date:

1 April 2019

Non-Caltech Author Email:

sid496 (AT) gmail.com

Record Number:

CaltechTHESIS:04032019-102853075

Persistent URL:

https://resolver.caltech.edu/CaltechTHESIS:04032019-102853075

DOI:

10.7907/K286-5N63

Related URLs:

URL	URL Type	Description
https://doi.org/10.1109/TIT.2017.2728079	DOI	Article Adapted for Ch. 2
https://doi.org/10.1109/TIT.2017.2730864	DOI	Article Adapted for Ch. 3
https://doi.org/10.1109/TIT.2017.2688361	DOI	Article Adapted for Ch. 4
https://doi.org/10.1109/ISIT.2017.8007104	DOI	Article Adapted for Ch. 4 and 5
https://doi.org/10.1109/ISIT.2018.8437526	DOI	Article Adapted for Ch. 6
https://doi.org/10.1101/517839	DOI	Article Adapted for Ch. 7

ORCID:

Author	ORCID
Jain, Siddharth	0000-0002-9164-6119

Default Usage Policy:

No commercial reproduction, distribution, display or performance rights in this work are provided.

ID Code:

11436

Collection:

CaltechTHESIS

Deposited By:

Siddharth Jain

Deposited On:

30 Apr 2019 17:43

Last Modified:

04 Oct 2019 00:25

Thesis Files

Preview

PDF - Final Version
See Usage Policy.
4MB

Plain Text (Supplementary File 1 (LUSC unamplified samples)) - Supplemental Material
See Usage Policy.
67kB

Plain Text (Supplementary File 2 (PAAD unamplified samples)) - Supplemental Material
See Usage Policy.
25kB

Plain Text (Supplementary File 3 (LUAD unamplified samples)) - Supplemental Material
See Usage Policy.
70kB

Plain Text (Supplementary File 4 (HNSC unamplified samples)) - Supplemental Material
See Usage Policy.
25kB

Plain Text (Supplementary File 5 (SKCM unamplified samples)) - Supplemental Material
See Usage Policy.
46kB

Plain Text (Supplementary File 6 (KIRC unamplified samples)) - Supplemental Material
See Usage Policy.
28kB

Plain Text (Supplementary File 7 (GBM unamplified samples)) - Supplemental Material
See Usage Policy.
35kB

Plain Text (Supplementary File 8 (PRAD unamplified samples)) - Supplemental Material
See Usage Policy.
67kB

Plain Text (Supplementary File 9 (STAD unamplified samples)) - Supplemental Material
See Usage Policy.
60kB

Plain Text (Supplementary File 10 (THCA unamplified samples)) - Supplemental Material
See Usage Policy.
68kB

Plain Text (Supplementary File 11 (LGG unamplified samples)) - Supplemental Material
See Usage Policy.
70kB

Plain Text (Supplementary File 12 (BLCA unamplified samples)) - Supplemental Material
See Usage Policy.
56kB

Plain Text (Supplementary File 13 (GBM amplified samples)) - Supplemental Material
See Usage Policy.
23kB

Plain Text (Supplementary File 14 (OV amplified samples)) - Supplemental Material
See Usage Policy.
29kB

Plain Text (Supplementary File 15 (LAML amplified samples)) - Supplemental Material
See Usage Policy.
18kB

MS Excel (Supplementary File 16 (Cancer Genes COSMIC)) - Supplemental Material
See Usage Policy.
141kB

Repository Staff Only: item control page