A fundamental challenge in life sciences is the characterization of genetic factors that underlie phenotypic differences. Thanks to the advanced sequencing technologies, an enormous amount of genetic variants have been identified and cataloged. Such data hold great potential to understand how genes affect phenotypes and contribute to the susceptibility to environmental stimulus. However, the existing computational methods for analyzing and interpreting the high-throughput genetic data are still in their infancy. The objective of this project is to systematically investigate the computational and statistical principles in modeling and discovering genetic basis of complex phenotypes. The proposed research provides answers to the following fundamental questions in genetic association study: (1) How to effectively and efficiently assess statistical significance of the findings? (2) How to account for the relatedness between samples in genetic association study? (3) How to accurately capture possible interactions between multiple genetic factors and their joint contribution to phenotypic variation? In particular, the team will develop a multi-layer indexing structure for robust and scalable multiple testing correction, a general phylogenetic tree based framework to account for local population structure, and an ensemble learning approach for studying joint effect of multiple genetic factors.

The research provides a computational framework for large scale genotype-phenotype association study. The outcome includes novel methods for addressing sample relatedness, capturing confounding factors, and controlling multiple testing errors which are widely applicable for many common data mining tasks including frequent pattern mining, multitask learning, and ensemble learning among others. Collectively, the theoretic framework and algorithms will provide the research community much better tools to dissect complex relationships between genotypes and phenotypes, and gain deeper understanding of the roles of environmental stimuli.

The proposed research directly involves applications in large scale genome-wide association study. Additional applications exist for biologists in their study of gene-gene interactions, metabolic pathways and protein-protein interaction networks. Beyond the applications proposed here, the algorithms can find wide applications in other areas of biology as well as other scientific disciplines. The methods will be evaluated thoroughly by both simulation and real data collected from yeast, mouse, and human. Early versions of the applications will be made available to the biological community through a web-based server to evaluate efficacy of the methods and to apply them to a broader set of problems.

The research findings and methods will be integrated into graduate and undergraduate instruction. The team already offer classes in computational biology and data-mining where the proposed tools will aid students in comprehending abstract concepts and data relations. They will also continue their commitment to supporting multidisciplinary educational experiences, and service to the research community, as well and proving research opportunities for undergraduate students.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1162369
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2012-10-01
Budget End
2012-12-31
Support Year
Fiscal Year
2011
Total Cost
$442,303
Indirect Cost
Name
University of North Carolina Chapel Hill
Department
Type
DUNS #
City
Chapel Hill
State
NC
Country
United States
Zip Code
27599