The genetics and genomics communities are advancing rapidly in the Next-Generation Sequencing (NGS) era. The identification of both common and rare genetic variants from large-cohort studies and Mendelian studies provides new opportunities to elucidate disease etiologies and underlying molecular mechanisms. That ultimately will lead to novel and personalized diagnostics, prognostics and therapeutic treatments. However, significant analytical challenges remain: (1) the discovery and haplotype phasing of rare variants remain difficult;(2) data analysis is fragmented when multiple datasets [SNP arrays, whole-exome sequencing (WES), and/or low-coverage whole-genome sequencing (WGS)] are available;and (3) bioinformatics methods and software are difficult to use for average users: there is no unified bioinformatics framework and many different tool sets are needed for an end-to-end process. Advanced computational and statistical methods and friendly software are urgently needed to meet the demand of the community. The overall goal of this application is to develop an integrative and novel analytical framework that can significantly improve the sensitivity and accuracy of rare variant discovery and haplotype phasing and harmonize multiple datasets in genomics studies. In order to do so, the following specific aims will be pursued: 1) Develop a framework for improvement of rare variant discovery and haplotype phasing using read information. 2) Develop a framework for integrating multiple genetic variation datasets. 3) Validate genotyping and phasing of rare variants for pipeline optimization and cross-evaluation between different methods using simulated and experimental data. 4) Develop software packages with Cloud deployment for the community. The approaches are innovative because they utilize novel concepts and methods to improve the accuracy of genotype calling and haplotype phasing from NGS data and to integrate multiple types of genotype data. Successful accomplishment of our proposed aims will dramatically improve the sensitivity and accuracy in rare variant discovery and phasing, expediting the understanding the genetic architecture of human diseases.

Public Health Relevance

Next generation sequencing technologies hold great promise for identifying causal genetic variants for human diseases but also pose daunting challenges for analytical and bioinformatics development. In this application, we will develop comprehensive statistical methods to improve accuracy of genotype calling and phasing of rare variants, develop a comprehensive framework for integrating multiple types of genotype data and sequencing data, and deploy Cloud based software tools as a cyber- infrastructure to serve the community. The proposed research is relevant to public health and the mission of NIH because the accomplishment of our proposed work is expected to facilitate the identification of genetic variants underlying human diseases, and help us to understand, prevent, diagnose, and treat these diseases.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-GGG-R (02))
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Baylor College of Medicine
Schools of Medicine
United States
Zip Code
Yang, Xinyu; Li, Jiani; Fang, Yabo et al. (2018) Rho Guanine Nucleotide Exchange Factor ARHGEF17 Is a Risk Gene for Intracranial Aneurysms. Circ Genom Precis Med 11:e002099
Rasmy, Laila; Wu, Yonghui; Wang, Ningtao et al. (2018) A study of generalizability of recurrent neural network-based predictive models for heart failure onset risk using a large and heterogeneous EHR data set. J Biomed Inform 84:11-16
Chen, Yiyun; Bartanus, Justin; Liang, Desheng et al. (2017) Characterization of chromosomal abnormalities in pregnancy losses reveals critical genes and loci for human early development. Hum Mutat 38:669-677
Dai, Hongying; Wu, Guodong; Wu, Michael et al. (2016) An Optimal Bahadur-Efficient Method in Detection of Sparse Signals with Applications to Pathway Analysis in Sequencing Association Studies. PLoS One 11:e0152667
Xue, Cheng; Chen, Hua; Yu, Fuli (2016) Base-Biased Evolution of Disease-Associated Mutations in the Human Genome. Hum Mutat 37:1209-1214
Huang, Zhuoyi; Rustagi, Navin; Veeraraghavan, Narayanan et al. (2016) A hybrid computational strategy to address WGS variant analysis in >5000 samples. BMC Bioinformatics 17:361
Zhi, Degui; Liu, Nianjun; Zhang, Kui (2015) On the design and analysis of next-generation sequencing genotyping for a cohort with haplotype-informative reads. Methods 79-80:41-6
Geng, Xin; Sha, Jin; Liu, Shikai et al. (2015) A genome-wide association study in catfish reveals the presence of functional hubs of related genes within QTLs for columnaris disease resistance. BMC Genomics 16:196
Challis, Danny; Antunes, Lilian; Garrison, Erik et al. (2015) The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes. BMC Genomics 16:143