Genome-wide association studies hold great promises to reveal the genetic architectures underlying human complex diseases. The disease variants are often non-Mendelian, demonstrating low penetrance and little effects to the disease individually, but interacting with each other and environments in unknown ways. With recent high-throughput sequencing technology, much more data are generated in the genome-scale, including not only genetic variants, but also regulatory elements at the individual-level. Regulatory factors are known to interact and act as mediators between sequence variation and phenotypic diversity. Multi-variant disease mapping therefore becomes more interesting and important for future genome-wide association studies. It is also hoped that, by collecting all variants in the human genome, we could identify the true causative variants, such that functional evaluation and validation experiments can be precisely developed at the identified sites to truly reveal their biological mechanisms to the disease. Identifying multi-variant association is extremely challenging. Current algorithms are still very limited. Particularly, high throughput sequencing data are now routinely generated in disease studies. These complete variants are highly dependent, for which existing methods have substantial computational difficulties and thus make it extremely difficult to pinpoint the true disease variants. It is also very challengingto detect disease associations from rare variants, which are however more abundant in the human genome, and could be the main contributor to human complex diseases. We propose to develop advanced algorithms to tackle the above problems. We will develop advanced algorithms to improve the power and the computational efficiency for whole genome multi-variant mapping. We also propose generalized methods to jointly test common and rare variants under a coherent full probabilistic model. Our approach automatically group variants for joint testing, account for dependence, incorporate biological priors, and identify causative variants. We further extend the methods via non-parametric Bayesian techniques to integrate various sources of public databases in disease mapping. My new algorithms will greatly enhance researchers'capability to analyze high-throughput genetic and genomic data. The software will be freely distributed to the community through the PI's website and the Galaxy system hosted at Penn State.

Public Health Relevance

The goal of the project is to develop new powerful and efficient statistical tools to advance our capability in analyzing genome-wide data sets for human complex diseases, and to better integrate publicly available knowledge bases into disease association mapping. Tools developed in this project will be freely distributed to the research community to facilitate bio-discovery towards understanding the regulatory mechanisms underlying human inherited complex phenotypes.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG004718-05
Application #
8532953
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
2008-08-15
Project End
2015-06-30
Budget Start
2013-07-01
Budget End
2014-06-30
Support Year
5
Fiscal Year
2013
Total Cost
$178,195
Indirect Cost
$53,195
Name
Pennsylvania State University
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
003403953
City
University Park
State
PA
Country
United States
Zip Code
16802
Zhang, Yu; Ghosh, Soumitra; Hakonarson, Hakon (2014) Dynamic Bayesian testing of sets of variants in complex diseases. Genetics 198:867-78
Lee, Yeonok; Ghosh, Debashis; Zhang, Yu (2014) Regression hidden Markov modeling reveals heterogeneous gene expression regulation: a case study in mouse embryonic stem cells. BMC Genomics 15:360
Zhang, Yu (2013) De novo inference of stratification and local admixture in sequencing studies. BMC Bioinformatics 14 Suppl 5:S17
Zhang, Yu (2013) A dynamic Bayesian Markov model for phasing and characterizing haplotypes in next-generation sequencing. Bioinformatics 29:878-85
Wu, Weisheng; Cheng, Yong; Keller, Cheryl A et al. (2011) Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration. Genome Res 21:1659-71
Zhang, Yu; Jiang, Bo; Zhu, Jun et al. (2011) Bayesian models for detecting epistatic interactions from genetic data. Ann Hum Genet 75:183-93
Zhang, By Yu; Zhang, Jing; Liu, Jun S (2011) BLOCK-BASED BAYESIAN EPISTASIS ASSOCIATION MAPPING WITH APPLICATION TO WTCCC TYPE 1 DIABETES DATA. Ann Appl Stat 5:2052-2077
Zhang, Yu (2011) Bayesian epistasis association mapping via SNP imputation. Biostatistics 12:211-22
Zhang, Yu; Liu, Jun S (2011) Fast and Accurate Approximation to Significance Tests in Genome-Wide Association Studies. J Am Stat Assoc 106:846-857
Cheng, Yong; Wu, Weisheng; Kumar, Swathi Ashok et al. (2009) Erythroid GATA1 function revealed by genome-wide analysis of transcription factor occupancy, histone modifications, and mRNA expression. Genome Res 19:2172-84