This research is to develop a comprehensive set of statistical methods applicable to high-dimensional and sparse datasets as currently generated by high-throughput genomic experiments. The PI proposes to introduce a novel unified statistical framework to handle large-scale genetic data, and to study its theoretical properties. Within this unified framework, a number of relevant problems can be addressed. The PI will investigate a series of association testing strategies with rare genetic variants that complement and generalize available methods for case-control designs to general designs involving individuals related in an arbitrary fashion. Moreover, the PI will develop an analytic theory to investigate the most powerful statistical designs for association studies with rare variants. The proposed methods will be tested on a broad range of simulated data, and real data from the PIs' collaborators.
The recent progress in genomic technologies has lead to large amounts of genetic data being generated. The emergence of such large-scale, sparse genetic data poses great statistical challenges that require novel and powerful approaches to efficiently extract the information contained in the data. While theoretical, the statistical methods proposed here have the potential to directly contribute to the understanding of the genetic mechanisms underlying complex human traits. To maximize their impact, the proposed methods will be implemented into a software package to be made available to the larger scientific community. Beyond its scientific importance, the project has the potential to contribute to the higher goal of improving the public health. The project also has a strong educational component, and will provide valuable research experience for students and postdoctoral fellows.