INTELLECTUAL MERIT OVERVIEW AND OBJECTIVES Every protein progresses through a natural lifecycle from birth to maturation to death; this pro- cess is coordinated by the protein homeostasis system. Dysregulation of the protein homeostasis system leads to imbalances in the proteome which can cause neurodegenerative pathologies such as Alzheimer's (Morawe et al., 2012), Huntington's (Shirasaki et al., 2012), or Parkinson's dis- ease (Cook et al., 2012). Recent technological advances in massively parallel mutagenesis and deep DNA sequencing are now enabling us to elucidate the genetic networks in complex cellular systems, such as protein homeostasis, that are essential to cellular function and under what condi- tions (e.g. carbon source, temperature, pH) those genetic networks are essential. But, identifying such conditionally essential networks (CENs) has been challenging for computational and statisti- cal reasons. The goal of this project is to elucidate and validate CENs in the protein homeostasis system by developing computationally efficient and statistically accurate methods for analyzing deep DNA sequencing data from massively parallel mutagenesis experiments. These CENs can then be used to identify biomarker combinations for diagnosis and treatment in humans or to identify regulatory networks in model organisms. The tools developed for this purpose will be generally useful for analyzing deep DNA sequencing data from massively parallel mutagenesis experiments on other cellular networks and in other organisms. The long-term goal of this research is to develop modern nonparametric Bayesian (NPB) mod- eling methods for large-scale genomic experiments. This proposal focuses on developing a NPB model for DNA sequencing data from massively parallel mutagenesis experiments and using that model to uncover latent genetic architecture of the protein homeostasis system. This model can rig- orously integrate sequencing data from samples of varying purity-from single-cell to heterogeneous mixtures-to reveal conditionally essential networks.
Our aims are supported by our previously published work on parallel mutagenesis, genomic data analysis, and protein homeostasis as well as by our preliminary data showing that our Bayesian model is able to learn hierarchical relevant mixture components in simulation data. The proposed rigorous statistical model will allow us to gain new insights into the protein homeostasis system, and it will benefit any researcher who seeks to interpret sequencing data from large-scale mutagenesis studies. We will achieve our objective by completing the fol- Model -~ Conditionally Essential Data -t ~ I ._, lowing specific aims: Networks sam les 0 I ~-~ 1. Develop and validate a nonparametric Bayesian model for identifying CENs from . parallel mutagenesis deep sequencing exper- iments , . ' 2. Identify and validate protein homeostasis CENs using transposon sequencing experi- Figure 1: Overview of the aims of this pro- ments posal. The nonparameric Bayesian model (Aim 1) is used to elucidate conditionally essential net- works (shown as blue, red, and green edges) from This project will create new statistical methods, mod- deep sequencing of mutant pools grown under many els, and software for analyzing DNA sequencing data from environmental conditions (Aim 2). massively parallel mutagenesis experiments to discover Jar tent conditionally essential networks. Transposon mutagenesis and sequencing (Tn-seq) will reveal conditionally essential gene networks-groups of genes that are essential in subsets of experimental conditions. 1