Carnegie-Mellon University is awarded a grant by the NSF Faculty Early Career Development (CAREER) Program for a promising young researcher to address several challenging computational and theoretical problems regarding regulatory evolution in metazoan species combined with an education agenda concerning both integrating evolutionary genomics into the computational biology (CompBio) curriculum at Carnegie Mellon and the University of Pittsburgh. The research component will develop novel computational methods and theoretical models to study the evolutionary mechanisms and processes that shape transcription regulatory network in metazoan species. A number of technical challenges, ranging from mapping the regulatory elements, especially the structurally complex cis-regulatory modules (CRM), will have to be tackled to develop appropriate models to capture the structural and functional evolution of these elements. The proposed research will focus on the following specific aims to address these challenges: Aim 1: Develop new methods for deciphering the cis-regulatory codes and the transcriptional regulatory networks in metazoan species; and apply them to map potentially all regulatory elements in the fruit fly. Aim 2: Develop new theories and algorithms for modeling regulatory evolution based on structural and functional transformations of regulatory elements; and use them, together with experimental means, to investigate the context-dependent cis-regulatory evolution in the fly. Aim 3: Develop algorithms for comparative genomic and network inference based on the new formalism of structural/functional phylogeny (i.e., from Aim 2). The main methodological novelties are: (1) a structure- and syntax-based CRM search algorithm; (2) a dynamic Bayesian network for inferring regulatory network; (3) context-dependent stochastic models for higher-order structural/functional evolution; and (4) phylo-genomic CRM finder and network structure predictor. The educational component focuses on developing a new, better balanced and deepened computational biology curriculum for the Joint CMU/Pitt Ph.D. program that provides both a wider coverage of fundamental mathematics and computer science principles, and working knowledge of a substantially expanded span of biological and biomedical fields. Other education plans include mentoring students, coordinating and contributing to curriculum building efforts. Understanding the genetic variation and its evolution helps to address many human health issues, such as the detection of deleterious genetic predispositions and prediction of the behavior of fast-evolving biological systems such as HIV virus and immune systems. In addition to their relevance to biology and medicine, the methodological advances in computing and statistical modeling can be easily translated into powerful and generic data-mining tools applicable to complex data beyond biology.

Project Report

Recent developments in genomics, computational and molecular biology, and population genetics have led to a convergence of interest in Regulatory Evolution—the evolutionary and dynamic aspects of gene expression regulation. In a multicellular organism, many important biological processes, such as stem cell differentiation, cancer development, and organism evolution, depend fundamentally on the spatial and temporal control of gene expression. To date, the molecular basis and evolutionary mechanisms underlying these processes remain largely unknown. Addressing these problems necessitates a comprehensive exploration of the combinatorial space of both cis- and trans-elements of the gene regulatory network, evolutionary events taking place at various levels, and the dynamic changes of network topology and functions over time and space. Supported by this career award, the laboratory of statistical artificial intelligence and integrative genomics (SAILING Lab) at Carnegie Mellon University led by PI Professor Eric Xing has carried out a comprehensive and in-depth investigation of the problems listed above, via mathematically well-founded new machine learning models and algorithms, and computational analysis of dataset from a wide spectrum of sources such as genomic sequences, mRNA abundances profiles of gene expressions, and immunohistochemical staining (IHS) of gene expression in cells and tissues, from multiple organisms including yeast, fruit fly, human stem cells, and human breast cancer cells, over various scenarios including evolution, development, differentiation, tumorigenesis, and cell cycles. These research works have led to a significant array of scientific, educational, and utility outcomes: 1) It has produced a large body of scientific results in the form of: new findings such as newly discovered genome evolution process, evolving gene regulatory elements, mechanisms and patterns of gene network evolution; new mathematical models and algorithms for making such findings possible, such as Dirichlet process based models/algorithms for haplotype inference and genome evolution modeling, coupled HMM models for regulatory element evolution, TESLA/KELLER/TREEGL algorithms for reverse engineering evolving networks, SPEX/NPMUSSEL algorithms for inferring networks from ISH images, etc.; new computer software available to the public for analyzing complex biological data described above and visualizing the results. These results have led to about 80 publications in top peer-reviewed journals in computational biology (e.g., PLoS Computational Biology, Bioinformatics, Genetics), Machine Learning (e.g., Journal of Machine Learning Research), Statistics (e.g., Annals of Applied Statistics), and peer-reviewed conferences such as ISMB, RECOMB, NIPS, ICML, etc. 2) It lays the foundation of a vibrant research lab – the SAILING Lab, at CMU, which is now hosting about 15-20 Ph.D. students, and many postdocs at any time. In particular, this grant has either completely or partially sponsored the work of about 6 Ph.D. students and 3 postdocs overall, who after finishing their training at CMU have now become either faculty at leading institutions worldwide, including University of Chicago, Georgia Institute of Technology, Arizona State University, University of Taxes Dallas, Tsinghua University, etc., or research staff at major IT companies such as Google and Facebook. 3) It has also resulted in a big collection of open source software available to the research community. And the results have also been disseminated through tutorials and keynotes lectures in various research workshops, conferences, and university colloquium. Parallel to the research plan, we have also proposed to pursue an education agenda concerning both integrating evolutionary genomics into the CMU computational biology curriculum, and broad issues of strengthening and balancing Bio/CS/Math training in the new joint CompBio Ph.D. program at Carnegie Mellon and the University of Pittsburgh. So far, this goal has been achieved well. The methodological advancements resultant from this grant have been well integrated into three courses taught at CMU/Pitt: Computational Genomics, Graduate Machine Learning, Graphical Models, which over years have been attended by hundreds of graduate students, and have had far reaching influences on the students’ research, and on later teachers of these courses. All these courses originally developed by PI Eric Xing at CMU have now become the required course at CMU graduate programs. In summary, with funding from the Career Award from NSF, we have achieved the original goals proposed in our proposal, and have made satisfactory contribution in scientific discovery, methodology development, tool production, and education outreach. We would like to thank NSF for the strong support throughout the duration of this project.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
0546594
Program Officer
Anne Haake
Project Start
Project End
Budget Start
2006-03-01
Budget End
2014-02-28
Support Year
Fiscal Year
2005
Total Cost
$1,312,321
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213