The genetics and genomics communities are advancing rapidly in the Next-Generation Sequencing (NGS) era. The identification of both common and rare genetic variants from large-cohort studies and Mendelian studies provides new opportunities to elucidate disease etiologies and underlying molecular mechanisms. That ultimately will lead to novel and personalized diagnostics, prognostics and therapeutic treatments. However, significant analytical challenges remain: (1) the discovery and haplotype phasing of rare variants remain difficult; (2) data analysis is fragmented when multiple datasets [SNP arrays, whole-exome sequencing (WES), and/or low-coverage whole-genome sequencing (WGS)] are available; and (3) bioinformatics methods and software are difficult to use for average users: there is no unified bioinformatics framework and many different tool sets are needed for an end-to-end process. Advanced computational and statistical methods and friendly software are urgently needed to meet the demand of the community. The overall goal of this application is to develop an integrative and novel analytical framework that can significantly improve the sensitivity and accuracy of rare variant discovery and haplotype phasing and harmonize multiple datasets in genomics studies. In order to do so, the following specific aims will be pursued: 1) Develop a framework for improvement of rare variant discovery and haplotype phasing using read information. 2) Develop a framework for integrating multiple genetic variation datasets. 3) Validate genotyping and phasing of rare variants for pipeline optimization and cross-evaluation between different methods using simulated and experimental data. 4) Develop software packages with Cloud deployment for the community. The approaches are innovative because they utilize novel concepts and methods to improve the accuracy of genotype calling and haplotype phasing from NGS data and to integrate multiple types of genotype data. Successful accomplishment of our proposed aims will dramatically improve the sensitivity and accuracy in rare variant discovery and phasing, expediting the understanding the genetic architecture of human diseases.
Next generation sequencing technologies hold great promise for identifying causal genetic variants for human diseases but also pose daunting challenges for analytical and bioinformatics development. In this application, we will develop comprehensive statistical methods to improve accuracy of genotype calling and phasing of rare variants, develop a comprehensive framework for integrating multiple types of genotype data and sequencing data, and deploy Cloud based software tools as a cyber- infrastructure to serve the community. The proposed research is relevant to public health and the mission of NIH because the accomplishment of our proposed work is expected to facilitate the identification of genetic variants underlying human diseases, and help us to understand, prevent, diagnose, and treat these diseases.