Genome-wide association studies hold great promises to reveal the genetic architectures underlying human complex diseases. The disease variants are often non-Mendelian, demonstrating low penetrance and little effects to the disease individually, but interacting with each other and environments in unknown ways. With recent high-throughput sequencing technology, much more data are generated in the genome-scale, including not only genetic variants, but also regulatory elements at the individual-level. Regulatory factors are known to interact and act as mediators between sequence variation and phenotypic diversity. Multi-variant disease mapping therefore becomes more interesting and important for future genome-wide association studies. It is also hoped that, by collecting all variants in the human genome, we could identify the true causative variants, such that functional evaluation and validation experiments can be precisely developed at the identified sites to truly reveal their biological mechanisms to the disease. Identifying multi-variant association is extremely challenging. Current algorithms are still very limited. Particularly, high throughput sequencing data are now routinely generated in disease studies. These complete variants are highly dependent, for which existing methods have substantial computational difficulties and thus make it extremely difficult to pinpoint the true disease variants. It is also very challengingto detect disease associations from rare variants, which are however more abundant in the human genome, and could be the main contributor to human complex diseases. We propose to develop advanced algorithms to tackle the above problems. We will develop advanced algorithms to improve the power and the computational efficiency for whole genome multi-variant mapping. We also propose generalized methods to jointly test common and rare variants under a coherent full probabilistic model. Our approach automatically group variants for joint testing, account for dependence, incorporate biological priors, and identify causative variants. We further extend the methods via non-parametric Bayesian techniques to integrate various sources of public databases in disease mapping. My new algorithms will greatly enhance researchers'capability to analyze high-throughput genetic and genomic data. The software will be freely distributed to the community through the PI's website and the Galaxy system hosted at Penn State.
The goal of the project is to develop new powerful and efficient statistical tools to advance our capability in analyzing genome-wide data sets for human complex diseases, and to better integrate publicly available knowledge bases into disease association mapping. Tools developed in this project will be freely distributed to the research community to facilitate bio-discovery towards understanding the regulatory mechanisms underlying human inherited complex phenotypes.
|Zhang, Yu; Ghosh, Soumitra; Hakonarson, Hakon (2014) Dynamic Bayesian testing of sets of variants in complex diseases. Genetics 198:867-78|
|Lee, Yeonok; Ghosh, Debashis; Zhang, Yu (2014) Regression hidden Markov modeling reveals heterogeneous gene expression regulation: a case study in mouse embryonic stem cells. BMC Genomics 15:360|
|Zhang, Yu (2013) De novo inference of stratification and local admixture in sequencing studies. BMC Bioinformatics 14 Suppl 5:S17|
|Zhang, Yu (2013) A dynamic Bayesian Markov model for phasing and characterizing haplotypes in next-generation sequencing. Bioinformatics 29:878-85|
|Wu, Weisheng; Cheng, Yong; Keller, Cheryl A et al. (2011) Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration. Genome Res 21:1659-71|
|Zhang, Yu; Jiang, Bo; Zhu, Jun et al. (2011) Bayesian models for detecting epistatic interactions from genetic data. Ann Hum Genet 75:183-93|
|Zhang, By Yu; Zhang, Jing; Liu, Jun S (2011) BLOCK-BASED BAYESIAN EPISTASIS ASSOCIATION MAPPING WITH APPLICATION TO WTCCC TYPE 1 DIABETES DATA. Ann Appl Stat 5:2052-2077|
|Zhang, Yu (2011) Bayesian epistasis association mapping via SNP imputation. Biostatistics 12:211-22|
|Zhang, Yu; Liu, Jun S (2011) Fast and Accurate Approximation to Significance Tests in Genome-Wide Association Studies. J Am Stat Assoc 106:846-857|
|Cheng, Yong; Wu, Weisheng; Kumar, Swathi Ashok et al. (2009) Erythroid GATA1 function revealed by genome-wide analysis of transcription factor occupancy, histone modifications, and mRNA expression. Genome Res 19:2172-84|