The University of Southern California is awarded an award for the development of integrated approaches to genome to phenome mapping. The project will utilize the rapidly accumulating body of genomics data, especially the enormous amount of public microarray data, together with the associated phenotypic and environmental context information to reconstruct the biological basis of phenotypes. Traditional association studies have been relatively successful at relating genetic polymorphisms to phenotypes. However, they have met difficulties in elucidating the gene-gene interactions that contribute to complex phenotypes. Just as the genome and proteome signify all of an organism's genes and proteins, the phenome represents the entirety of its phenotypic traits. The objective of the project is to develop novel methods aimed at deriving genome-wide molecular networks of genotype-phenotype associations, termed "phenomic associations.? The project will carry out four specific activities: (1) genome-wide mapping of phenotype-specific network modules; (2) systematic reconstruction of phenotype-specific transcriptional regulatory networks; (3) phenotype prediction and computational diagnosis utilizing public genomics databases, especially the large public microarray repositories, to create an automated disease diagnosis database; and (4) development of a software/database system that will integrate genomics and phenomics analysis. The proposed research is coupled with multidisciplinary training of graduate and undergraduate students in computational genomics. A forum will created for high school biology teachers and career discussion events will be hosted for high school students. Community outreach activities will be supported by the USC Neighborhood Academic Initiative.

Project Report

During the project period, we have developed a series of computational and statistical methods to map genome to phenome. We focused on three areas: Network-based approaches to map the transcriptome to the phenome. Although many studies have been successful in the discovery of cooperating groups of genes, mapping these groups to phenotypes has proved a much more challenging task. We developed several approaches to perform genome-wide mapping of gene coexpression modules onto the phenome. These approaches are efficient and scalable, and can apply to unweighted and weighted networks. Although we used co-expression networks as the testing system, our methods are generally applicable to any kind of abundant network data with defined phenotype association, and thus paves the way for genome-wide, gene network-phenotype maps. Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. The rapid accumulation of gene expression data has offered unprecedented opportunities to study human diseases. The National Center for Biotechnology Information Gene Expression Omnibus is one of the largest databases that systematically documents the genome-wide molecular basis of diseases. However, thus far, this resource has been far from fully utilized. We described the first study to transform public gene expression repositories into an automated disease diagnosis database. Particularly, we have developed a systematic framework, including a two-stage Bayesian learning approach, to achieve the diagnosis of one or multiple diseases for a query expression profile along a hierarchical disease taxonomy. A high level of overall diagnostic accuracy was shown by cross validation. It was also demonstrated that the power of our method can increase significantly with the continued growth of public gene expression repositories. We also showed how our disease diagnosis system can be used to characterize complex phenotypes and to construct a disease-drug connectivity map. Matrix decomposition methods to analyze multi-dimensional genomics data. The rapid development of high-throughput technology has made it possible to perform high-resolution genome profiling on several platforms simultaneously (e.g., DNA methylation, gene expression, and CNV), resulting in an abundance of multidimensional genomic data. Such data provide unique and unprecedented opportunities to explore the coordination and cooperation between regulatory mechanisms on multiple levels. To address these challenges, we developed a joint matrix factorization method, network-regularized joint matrix factorization, and a partial least square regression method adapted to sparse block matrices. These methods address the problem of integrating multiple datasets in an unsupervised, semi-supervised, or supervised manner respectively. We showed that these methods can reveal phenotypically relevant patterns that would have been overlooked with only a single type of data, and uncover new associations between the different layers of cellular activity.

National Science Foundation (NSF)
Division of Biological Infrastructure (DBI)
Application #
Program Officer
Peter H. McCartney
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Southern California
Los Angeles
United States
Zip Code