Next generation sequencing has revealed the molecular landscape of cells in unprecedented detail. However, for the massively large-scale data produced by assays based on these technologies, informativeness is not only a function of wet-lab technology, but is critically also a function of the analytical pipelines that interpret the data. Our group has developed four statistical tools designed maximize the informativeness of these assays: 1) the Genome Structural Correction (GSC), a nonparametric model of genomic annotations used to assess the significance of relationships between features; 2) the Irreproducible Discovery Rate (IDR), an analogue of the FDR that leverages information from biological replicates; 3) Statmap, a comprehensive analysis pipeline for ChIP-seq and CAGE data that propagates statistical confidence from base-calling to peak-calling; and 4) Sparse Linear Isoform Discovery and abundance Estimation (SLIDE), an integrative statistical framework for the analysis of RNA-seq, cDNA, and other RNA data aimed at obtaining and quantifying de novo transcript models. These tools are designed to identify and characterize functional elements in genomes; they make minimal assumptions about the data they analyze, and therefore draw reliable conclusions and measures of statistical confidence. During the K99, we will expand and integrate our tools to extend the reach of statistical confidence throughout data interpretatoin. During the R00, my research will progress toward the inference and assessment of biological networks. Just as ortholog identification has become an essential step in developing animal models of human disease, multi-species network analysis promises to become a key step in interpreting the relationship between genome variation and phenotype. Many mutations, even gene deletions, do not reveal an obvious phenotype. This is due to network robustness, which often differs between closely related species. To understand these phenomena, we aim to: 1) develop standard statistical tools for network inference, and 2) develop meta models of networks that will permit general measures of network orthology. These two aims are tightly linked: we will need critically to characterize the semantics of biological networks to model them. Currently, some models lack consistent definitions of edges and weights, resulting in untestable representations of genomics data. you've managed to have a relaxing weekend! We will develop testable, quantitative models of biological processes, establishing a uniform semantics leveraging the rich theory of complex systems. Each of the tools above will play a key role, especially Statmap and the GSC, which will be needed to propagate statistical confidence into network analysis. Advances will have a transformative effect on our ability to map animal models of disease onto human biology. Nearly nine out of ten new drugs fail in human trials due to issues (e.g. toxicity) not present in animal models. Understanding the orthology not just of individual genes, but of entire biochemical networks will be essential to infer and correct for differences between models of disease and human biology. Solving this problem will be a major step forward in the march from ?base-pairs to bedside?.
This proposal outlines training and mentoring plans that emphasize modern nonparametric statistical theory, developmental biology, and hands-on wet-lab techniques. The goal is to produce an independent investigator who functions as a nexus of communication between data producers and data analysts; who is able to recognize and to solve otherwise 'orphan' problems: important biological questions that require advances in statistical theory to be well-answered. The statistical tools that the candidate will generate during the award will lead to testable, quantitative models of biological processes, with the ultimate goal of establishing a uniform semantics for biological network analysis that leverages the increasingly rich theory of complex systems.
|Baillie, J Kenneth; Bretherick, Andrew; Haley, Christopher S et al. (2018) Shared activity patterns arising at genetic susceptibility loci reveal underlying genomic and cellular architecture of human disease. PLoS Comput Biol 14:e1005934|
|Kvist, Jouni; Gonçalves Athanàsio, Camila; Shams Solari, Omid et al. (2018) Pattern of DNA Methylation in Daphnia: Evolutionary Perspective. Genome Biol Evol 10:1988-2007|
|Basu, Sumanta; Kumbier, Karl; Brown, James B et al. (2018) Iterative random forests to discover predictive and stable high-order interactions. Proc Natl Acad Sci U S A 115:1943-1948|
|Orsini, Luisa; Brown, James B; Shams Solari, Omid et al. (2018) Early transcriptional response pathways in Daphnia magna are coordinated in networks of crustacean-specific genes. Mol Ecol 27:886-897|
|Parra, Marilyn; Booth, Ben W; Weiszmann, Richard et al. (2018) An important class of intron retention events in human erythroblasts is regulated by cryptic exons proposed to function as splicing decoys. RNA 24:1255-1265|
|Miyano, Masaru; Sayaman, Rosalyn W; Stoiber, Marcus H et al. (2017) Age-related gene expression in luminal epithelial cells is driven by a microenvironment made from myoepithelial cells. Aging (Albany NY) 9:2026-2051|
|Orsini, Luisa; Gilbert, Donald; Podicheti, Ram et al. (2017) Daphnia magna transcriptome by RNA-Seq across 12 environmental stressors. Sci Data 4:170006|
|Zhang, Weiguo; Mao, Jian-Hua; Zhu, Wei et al. (2016) Centromere and kinetochore gene misexpression predicts cancer patient survival and response to radiotherapy and chemotherapy. Nat Commun 7:12619|
|Stoiber, Marcus; Celniker, Susan; Cherbas, Lucy et al. (2016) Diverse Hormone Response Networks in 41 Independent Drosophila Cell Lines. G3 (Bethesda) 6:683-94|
|Mao, Jian-Hua; Langley, Sasha A; Huang, Yurong et al. (2015) Identification of genetic factors that modify motor performance and body weight using Collaborative Cross mice. Sci Rep 5:16247|
Showing the most recent 10 out of 12 publications