With hundreds of sequenced genomes available for many species, the challenge now lies in building predictive models for the genotype-to-phenotype map. Millions of polymorphic bases make each of us morphologically, intellectually, and psychologically unique. The approach of associating whole-genome polymorphisms with a myriad of phenotypes (GWAS) has been in fashion. Its reliance on purely statistical associations requires screening many thousands of individuals to pinpoint alleles that typically explain appreciable, though modest, fractions of natural variation. The next step - the long term goal of this project - is to move from association to causation;where a model of well-understood molecular pathways is modified, individually for each genotype, to reflect functional effects of it unique set of polymorphisms. We develop the concepts and models necessary to advance this goal using Drosophila, where the molecular tools are precise and quantitative predictions are verifiable. We will develop several levels of predictive models. First, we will predict the functioal consequences of SNPs on gene expression from sequence alone, based on knowledge of transcription factor (TF) binding sites and predictive models of how sequence affects DNA shape. These models will be validated with cis-eQTL approaches and directed measurements of expression and TF binding. Second, the composite effects of coding and regulatory polymorphisms will be incorporated into a network-level structural equation model (SEM). We will fit the model with two types of expression data gathered in multiple genotypes, and predict and experimentally verify the functional consequences of unmeasured polymorphisms. Third, the model will be extended to incorporate putative epistatic interactions, estimated using approximate Bayesean computation. This will generalize and 'quantitate'SEM, and evaluate sensitivity of downstream phenotypes to molecular perturbations at different tiers. We will validate these predictions using population genetic data. While conceptually simple, developing this framework requires close collaborations between computational and molecular biologists building refined molecular biological knowledge and tools. A developmental process - early embryo segmentation in Drosophila melanogaster - appears ripe for attack. The network is well-characterized and a wealth of functional data is available on the individual components, including DNA binding preferences and cellular resolution expression patterns of critical TFs. The requisite experimental techniques are scalable to process many sequenced fly genotypes. Abundant genetic variation in expression, timing, and morphology during embryo development are well-documented. Building the first mechanistic model of the embryo genotype-to-phenotype map is our focus, but this will have a strong impact on the medical field. Success in developing these integrated approaches will enable optimal choice of targets for therapeutic interventions to restore network function in disease. The concepts and tools we establish will serve as a template for analysis of complex networks relevant to human health.

Public Health Relevance

To build predictive genotype-to-phenotype maps, genetic epidemiologists must move from association to causation, where a model of well-understood molecular pathways is modified, individually for each genotype, to reflect functional effects of it unique set of polymorphisms. However, there is much scope for refinement using the Drosophila model, where the molecular tools are precise, and quantitative predictions are verifiable. This project will develop a variety of predictive models to accomplish this aim: annotating functional regulatory polymorphisms, developing linear network analysis, and annotating molecular networks in the context of population variation with approximate Bayesean computation;cellular and whole embryo data scales will be merged;joint models will be built to combine multiple scales of modeling.

National Institute of Health (NIH)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZEB1)
Program Officer
Lyster, Peter
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Southern California
Schools of Arts and Sciences
Los Angeles
United States
Zip Code
Chiu, Tsu-Pei; Yang, Lin; Zhou, Tianyin et al. (2015) GBshape: a genome browser database for DNA shape annotations. Nucleic Acids Res 43:D103-9
Dantas Machado, Ana Carolina; Zhou, Tianyin; Rao, Satyanarayan et al. (2015) Evolving insights on how cytosine methylation affects protein-DNA binding. Brief Funct Genomics 14:61-73
Yang, Lin; Zhou, Tianyin; Dror, Iris et al. (2014) TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res 42:D148-55
Slattery, Matthew; Zhou, Tianyin; Yang, Lin et al. (2014) Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 39:381-99
Barozzi, Iros; Simonatto, Marta; Bonifacio, Silvia et al. (2014) Coregulation of transcription factor binding and nucleosome occupancy through DNA features of mammalian enhancers. Mol Cell 54:844-57
Zhang, Xiaojun; Dantas Machado, Ana Carolina; Ding, Yuan et al. (2014) Conformations of p53 response elements in solution deduced using site-directed spin labeling and Monte Carlo sampling. Nucleic Acids Res 42:2789-97
Dror, Iris; Zhou, Tianyin; Mandel-Gutfreund, Yael et al. (2014) Covariation between homeodomain transcription factors and the shape of their DNA binding sites. Nucleic Acids Res 42:430-41
Nuzhdin, Sergey V; Turner, Thomas L (2013) Promises and limitations of hitchhiking mapping. Curr Opin Genet Dev 23:694-9