Genome-wide association studies hold great promises to reveal the genetic architectures underlying human complex diseases. The disease variants are often non-Mendelian, demonstrating low penetrance and little effects to the disease individually, but interacting with each other and environments in unknown ways. With recent high-throughput sequencing technology, much more data are generated in the genome-scale, including not only genetic variants, but also regulatory elements at the individual-level. Regulatory factors are known to interact and act as mediators between sequence variation and phenotypic diversity. Multi-variant disease mapping therefore becomes more interesting and important for future genome-wide association studies. It is also hoped that, by collecting all variants in the human genome, we could identify the true causative variants, such that functional evaluation and validation experiments can be precisely developed at the identified sites to truly reveal their biological mechanisms to the disease. Identifying multi-variant association is extremely challenging. Current algorithms are still very limited. Particularly, high throughput sequencing data are now routinely generated in disease studies. These complete variants are highly dependent, for which existing methods have substantial computational difficulties and thus make it extremely difficult to pinpoint the true disease variants. It is also very challengingto detect disease associations from rare variants, which are however more abundant in the human genome, and could be the main contributor to human complex diseases. We propose to develop advanced algorithms to tackle the above problems. We will develop advanced algorithms to improve the power and the computational efficiency for whole genome multi-variant mapping. We also propose generalized methods to jointly test common and rare variants under a coherent full probabilistic model. Our approach automatically group variants for joint testing, account for dependence, incorporate biological priors, and identify causative variants. We further extend the methods via non-parametric Bayesian techniques to integrate various sources of public databases in disease mapping. My new algorithms will greatly enhance researchers'capability to analyze high-throughput genetic and genomic data. The software will be freely distributed to the community through the PI's website and the Galaxy system hosted at Penn State.
The goal of the project is to develop new powerful and efficient statistical tools to advance our capability in analyzing genome-wide data sets for human complex diseases, and to better integrate publicly available knowledge bases into disease association mapping. Tools developed in this project will be freely distributed to the research community to facilitate bio-discovery towards understanding the regulatory mechanisms underlying human inherited complex phenotypes.
Showing the most recent 10 out of 18 publications