Genome-wide association studies hold great promises to reveal the genetic architectures underlying human complex diseases. The disease variants are often non-Mendelian, demonstrating low penetrance and little effects to the disease individually, but interacting with each other and environments in unknown ways. With recent high-throughput sequencing technology, much more data are generated in the genome-scale, including not only genetic variants, but also regulatory elements at the individual-level. Regulatory factors are known to interact and act as mediators between sequence variation and phenotypic diversity. Multi-variant disease mapping therefore becomes more interesting and important for future genome-wide association studies. It is also hoped that, by collecting all variants in the human genome, we could identify the true causative variants, such that functional evaluation and validation experiments can be precisely developed at the identified sites to truly reveal their biological mechanisms to the disease. Identifying multi-variant association is extremely challenging. Current algorithms are still very limited. Particularly, high throughput sequencing data are now routinely generated in disease studies. These complete variants are highly dependent, for which existing methods have substantial computational difficulties and thus make it extremely difficult to pinpoint the true disease variants. It is also very challengingto detect disease associations from rare variants, which are however more abundant in the human genome, and could be the main contributor to human complex diseases. We propose to develop advanced algorithms to tackle the above problems. We will develop advanced algorithms to improve the power and the computational efficiency for whole genome multi-variant mapping. We also propose generalized methods to jointly test common and rare variants under a coherent full probabilistic model. Our approach automatically group variants for joint testing, account for dependence, incorporate biological priors, and identify causative variants. We further extend the methods via non-parametric Bayesian techniques to integrate various sources of public databases in disease mapping. My new algorithms will greatly enhance researchers'capability to analyze high-throughput genetic and genomic data. The software will be freely distributed to the community through the PI's website and the Galaxy system hosted at Penn State.

Public Health Relevance

The goal of the project is to develop new powerful and efficient statistical tools to advance our capability in analyzing genome-wide data sets for human complex diseases, and to better integrate publicly available knowledge bases into disease association mapping. Tools developed in this project will be freely distributed to the research community to facilitate bio-discovery towards understanding the regulatory mechanisms underlying human inherited complex phenotypes.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Pennsylvania State University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
University Park
United States
Zip Code
Zhang, Yu; Tian, Lifeng; Sleiman, Patrick et al. (2017) Bayesian analysis of genome-wide inflammatory bowel disease data sets reveals new risk loci. Eur J Hum Genet :
Zhang, Yu; An, Lin; Yue, Feng et al. (2016) Jointly characterizing epigenetic dynamics across multiple human cell types. Nucleic Acids Res 44:6721-31
Lee, Yeonok; Ghosh, Debashis; Zhang, Yu (2014) Regression hidden Markov modeling reveals heterogeneous gene expression regulation: a case study in mouse embryonic stem cells. BMC Genomics 15:360
Chen, Kuan-Bei; Hardison, Ross; Zhang, Yu (2014) dCaP: detecting differential binding events in multiple conditions and proteins. BMC Genomics 15 Suppl 9:S12
Lee, Yeonok; Ghosh, Debashis; Hardison, Ross C et al. (2014) MRHMMs: multivariate regression hidden Markov models and the variantS. Bioinformatics 30:1755-6
Zhang, Yu; Ghosh, Soumitra; Hakonarson, Hakon (2014) Dynamic Bayesian testing of sets of variants in complex diseases. Genetics 198:867-78
Zhang, Yu (2013) De novo inference of stratification and local admixture in sequencing studies. BMC Bioinformatics 14 Suppl 5:S17
Lee, Yeonok; Ghosh, Debashis; Zhang, Yu (2013) Association testing to detect gene-gene interactions on sex chromosomes in trio data. Front Genet 4:239
Zhang, Yu (2013) A dynamic Bayesian Markov model for phasing and characterizing haplotypes in next-generation sequencing. Bioinformatics 29:878-85
Zhang, Yu (2012) A novel bayesian graphical model for genome-wide multi-SNP association mapping. Genet Epidemiol 36:36-47

Showing the most recent 10 out of 18 publications