Identification of genetic factors that contributing to complex diseases is one of the grant challenges in the post-genomic era. A series of exciting new findings were made recently using the genome wide association study (GWAS) design. However, moving from confirmed association signal to the collection of causal variants at a given locus poses significant challenges. A desirable follow-up strategy of GWAS is to conduct a comprehensively resequencing analysis at the genomic regions of interest. This will allow scientists to comprehensively discover and study all sequence variants, which greatly increase the chance of identifying new disease causing mutations. Rapid advances in the next generation sequencing technologies are making such a strategy increasingly feasible. Novel statistical methods need to be developed in order to analyze data generated from these new sequencing instruments. In this proposal, we focus on identifying single nucleotide polymorphisms (SNPs) from resequencing data generated from the Illumina Genome Analyzer platform. First, we will develop a probability-based model that allow us to simultaneously perform mapping of multi- mapped short sequencing reads, identifying sequencing errors, and calling SNPs and their genotypes. Since our method will be developed under the Bayesian framework, additional information such as the genotypes obtained from GWAS can be incorporated as informative priors to improve our inference. Second, we will develop a probability- based approach that combine sequencing read data at selected loci from multiple individuals to improve SNP and genotype calling. The goal is to borrow strength among a pool of samples to resolve ambiguity at loci with low sequencing depth. We will implement our statistical methods in freely available software tools to facilitate analysis of targeted resequencing studies. Finally, we plan to apply our methods on data generated from real targeted resequencing studies that is being planned for psoriasis and type 2 diabetes through collaboration.

Public Health Relevance

Next generation sequencing technologies facilitate large scale resequencing studies which offer us better chances of identifying disease-causing mutations. In this proposal, we will develop novel statistical methods for the identification of genetic variants from the so called ultra-high-throughput sequencing data. When completed, software tools and methods will be made freely available to allow better analysis of data generated from resequencing studies.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Genetic Variation and Evolution Study Section (GVE)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Emory University
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Johnston, Henry Richard; Hu, Yi-Juan; Gao, Jingjing et al. (2017) Identifying tagging SNPs for African specific genetic variation from the African Diaspora Genome. Sci Rep 7:46398
Mathias, Rasika Ann; Taub, Margaret A; Gignoux, Christopher R et al. (2016) A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome. Nat Commun 7:12522
Kessler, Michael D; Yerges-Armstrong, Laura; Taub, Margaret A et al. (2016) Challenges and disparities in the application of personalized genomic medicine to populations with African ancestry. Nat Commun 7:12521
Yuan, Shuai; Johnston, H Richard; Zhang, Guosheng et al. (2015) One Size Doesn't Fit All - RefEditor: Building Personalized Diploid Reference Genome to Improve Read Mapping and Genotype Calling in Next Generation Sequencing Studies. PLoS Comput Biol 11:e1004448
Yang, Rendong; Chen, Li; Newman, Scott et al. (2014) Integrated analysis of whole-genome paired-end and mate-pair sequencing data for identifying genomic structural variations in multiple myeloma. Cancer Inform 13:49-53
Yuan, Shuai; Qin, Zhaohui (2012) Read-mapping using personalized diploid reference genome for RNA sequencing data reduced bias for detecting allele-specific expression. IEEE Int Conf Bioinform Biomed Workshops 2012:718-724