Emerging sequencing technologies have made whole-genome sequencing become available for researches to study various phenotypes/diseases of interest, particularly focusing on rare variants sites. Although the first batch of sequencing projects has mainly focused on the analysis of unrelated individuals, numerous sequencing studies including related individuals have been carried out or launched recently as the sequencing cost reduces rapidly. However, the methodologies for analyzing family-based sequence data are largely falling behind partially due to the complexity of family structures and computational barrier. In this study, our primary goals are to efficiently and accurately infer individual genotypes and haplotypes - the key component of any sequencing project - by combining information from both family and population levels, and to study how differential sequencing errors will affect downstream association analysis. To achieve these goals, we propose specific aims as follows: 1) We will propose a novel statistical framework for genotyping calling and haplotype inference of sequence data including relative individuals. The new method takes advantages of both short stretches shared between unrelated individuals and long stretches shared between family members in a computationally feasible manner while retaining a high degree of accuracy via the synergy between two classic approaches: hidden Markov model (HMM) for linkage disequilibrium information and Lander-Green algorithm for inheritance vectors;2) We will develop an exact algorithm for HMM computation to speed up a class of widely use genetics programs, including the method developed in Aim 1, without any sacrifice of accuracy;3) We will assess the impact of sequencing errors on family-based association methods for rare variants and use the intrinsic stochastic nature of the proposed methods in Aim 1 to reduce the false positives under a framework of multiple imputation;4) We will test and recalibrate our developed methods in collaboration with ongoing sequencing projects and systematically investigate different study designs. Successful completion of these aims will yield state-of-the-art statistical methods and software, which will facilitate the fast growing sequencing projects including family members and guide the design and analysis of future studies.

Public Health Relevance

Next generation sequencing studies have been widely conducted to identify rare variants associated with complex diseases. We will develop several statistical and computational methods, including genotype calling and association analysis, to facilitate the analysis of both population and family-based sequence data for ongoing and future sequencing projects.

National Institute of Health (NIH)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pittsburgh
Schools of Medicine
United States
Zip Code