Next generation sequencing has revealed the molecular landscape of cells in unprecedented detail. However, for the massively large-scale data produced by assays based on these technologies, informativeness is not only a function of wet-lab technology, but is critically also a function of the analytical pipelines that interpret th data. Our group has developed four statistical tools designed maximize the informativeness of these assays: 1) the Genome Structural Correction (GSC), a nonparametric model of genomic annotations used to assess the significance of relationships between features;2) the Irreproducible Discovery Rate (IDR), an analogue of the FDR that leverages information from biological replicates;3) Statmap, a comprehensive analysis pipeline for ChIP-seq and CAGE data that propagates statistical confidence from base-calling to peak-calling;and 4) Sparse Linear Isoform Discovery and abundance Estimation (SLIDE), an integrative statistical framework for the analysis of RNA-seq, cDNA, and other RNA data aimed at obtaining and quantifying de novo transcript models. These tools are designed to identify and characterize functional elements in genomes;they make minimal assumptions about the data they analyze, and therefore draw reliable conclusions and measures of statistical confidence. During the K99, we will expand and integrate our tools to extend the reach of statistical confidence throughout data interpretation. During the R00, my research will progress toward the inference and assessment of biological networks. Just as ortholog identification has become an essential step in developing animal models of human disease, multi-species network analysis promises to become a key step in interpreting the relationship between genome variation and phenotype. Many mutations, even gene deletions, do not reveal an obvious phenotype. This is due to network robustness, which often differs between closely related species. To understand these phenomena, we aim to: 1) develop standard statistical tools for network inference, and 2) develop """"""""meta models"""""""" of networks that will permit general measures of network orthology. These two aims are tightly linked: we will need critically to characterize the semantics of biological networks to model them. Currently, some models lack consistent definitions of edges and weights, resulting in untestable representations of genomics data. We will develop testable, quantitative models of biological processes, establishing a uniform semantics leveraging the rich theory of complex systems. Each of the tools above will play a key role, especially Statmap and the GSC, which will be needed to propagate statistical confidence into network analysis. Advances will have a transformative effect on our ability to map animal models of disease onto human biology. Nearly nine out of ten new drugs fail in human trials due to issues (e.g. toxicity) not present in animal models. Understanding the orthology not just of individual genes, but of entire biochemical networks will be essential to infer and correct for differences between models of disease and human biology. Solving this problem will be a major step forward in the march from """"""""base-pairs to bedside"""""""".
This proposal outlines training and mentoring plans that emphasize modern nonparametric statistical theory, developmental biology, and hands-on wet-lab techniques. The goal is to produce an independent investigator who functions as a nexus of communication between data producers and data analysts;who is able to recognize and to solve otherwise """"""""orphan"""""""" problems: important biological questions that require advances in statistical theory to be well-answered. The statistical tools that the candidate will generate during the award will lead to testable, quantitative models of biological processes, with the ultimate goal of establishing a uniform semantics for biological network analysis that leverages the increasingly rich theory of complex systems.