Emerging sequencing technologies have made whole-genome sequencing become available for researches to study various phenotypes/diseases of interest, particularly focusing on rare variants sites. Although the first batch of sequencing projects has mainly focused on the analysis of unrelated individuals, numerous sequencing studies including related individuals have been carried out or launched recently as the sequencing cost reduces rapidly. However, the methodologies for analyzing family-based sequence data are largely falling behind partially due to the complexity of family structures and computational barrier. In this study, our primary goals are to efficiently and accurately infer individual genotypes and haplotypes - the key component of any sequencing project - by combining information from both family and population levels, and to study how differential sequencing errors will affect downstream association analysis. To achieve these goals, we propose specific aims as follows: 1) We will propose a novel statistical framework for genotyping calling and haplotype inference of sequence data including relative individuals. The new method takes advantages of both short stretches shared between unrelated individuals and long stretches shared between family members in a computationally feasible manner while retaining a high degree of accuracy via the synergy between two classic approaches: hidden Markov model (HMM) for linkage disequilibrium information and Lander-Green algorithm for inheritance vectors; 2) We will develop an exact algorithm for HMM computation to speed up a class of widely use genetics programs, including the method developed in Aim 1, without any sacrifice of accuracy; 3) We will assess the impact of sequencing errors on family-based association methods for rare variants and use the intrinsic stochastic nature of the proposed methods in Aim 1 to reduce the false positives under a framework of multiple imputation; 4) We will test and recalibrate our developed methods in collaboration with ongoing sequencing projects and systematically investigate different study designs. Successful completion of these aims will yield state-of-the-art statistical methods and software, which will facilitate the fast growing sequencing projects including family members and guide the design and analysis of future studies.

Public Health Relevance

Next generation sequencing studies have been widely conducted to identify rare variants associated with complex diseases. We will develop several statistical and computational methods, including genotype calling and association analysis, to facilitate the analysis of both population and family-based sequence data for ongoing and future sequencing projects.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG007358-03
Application #
9002848
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
2014-04-25
Project End
2019-01-31
Budget Start
2016-02-01
Budget End
2017-01-31
Support Year
3
Fiscal Year
2016
Total Cost
Indirect Cost
Name
University of Pittsburgh
Department
Pediatrics
Type
Schools of Medicine
DUNS #
004514360
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213
Sun, Zhe; Wang, Ting; Deng, Ke et al. (2018) DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data. Bioinformatics 34:139-146
Yan, Qi; Fang, Zhou; Chen, Wei (2018) KMgene: a unified R package for gene-based association analysis for complex traits. Bioinformatics 34:2144-2146
Chiu, Chi-Yang; Jung, Jeesun; Chen, Wei et al. (2017) Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models. Eur J Hum Genet 25:350-359
Hui, Daniel; Fang, Zhou; Lin, Jerome et al. (2017) LAIT: a local ancestry inference toolkit. BMC Genet 18:83
Forno, Erick; Wang, Ting; Yan, Qi et al. (2017) A Multiomics Approach to Identify Genes Associated with Childhood Asthma Risk and Morbidity. Am J Respir Cell Mol Biol 57:439-447
Chen, Han; Wang, Chaolong; Conomos, Matthew P et al. (2016) Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models. Am J Hum Genet 98:653-66
Wang, Ting; Ren, Zhao; Ding, Ying et al. (2016) FastGGM: An Efficient Algorithm for the Inference of Gaussian Graphical Model in Biological Networks. PLoS Comput Biol 12:e1004755
Fan, Ruzong; Wang, Yifan; Yan, Qi et al. (2016) Gene-Based Association Analysis for Censored Traits Via Fixed Effect Functional Regressions. Genet Epidemiol 40:133-43
Zeng, Zhen; Weeks, Daniel E; Chen, Wei et al. (2016) A Pipeline for Classifying Relationships Using Dense SNP/SNV Data and Putative Pedigree Information. Genet Epidemiol 40:161-71
Yan, Qi; Chen, Rui; Sutcliffe, James S et al. (2016) The impact of genotype calling errors on family-based studies. Sci Rep 6:28323

Showing the most recent 10 out of 21 publications