The immanent influx of high-throughout sequencing datasets poses both a unique opportunity to identify the disease susceptibility loci for complex disease and their pathways and a challenge in terms of the statistical analysis. Many of the loci that are recorded by high-throughput sequencing studies will be rare, providing insufficient power for the statistical analysis. For studies with unrelated cases and controls, a number of collapsing approaches has been suggested. However, such methodology does not exist for family-based studies which are by design well suited for rare-variant analysis. They have higher statistical power for rare variants and are robust against population admixture. For population-based designs, statistical approaches that adjust the analysis for such confounding do not exist if the variants are rare. However, for the construction of collapsing method for family-based designs, the linkage disequilibrium (LD) between the loci has to be estimated which is a non-trivial task for rare variants. In population-base designs, this issue can be avoid by utilizing permutation tests that randomly assign the phenotype, but keep the genetic data in a subject fixed. This is not possible in family-based designs. In this grant application, we will develop an analytical approach to the LD-estimation problem in family-based designs. This will enable the construction of rare variant tests for family-based designs. The major goal of sequence-analysis is the identification of the DSLs. The significance of single-locus association tests is defined by the genetic effect size and the allele frequency. Since non-DSLs that are in LD with the true DSL can have higher allele frequencies than the DSL, but have smaller, observed genetic effect sizes, the significance of the test cannot be used to identify DSLs. In order to distinguish the true DSLs from SNPs that are in LD with the DSLs, we will develop statistical approaches that assess differences in LD-pattern across multiple loci between subjects are required. Such methodology will be proposed for designs of unrelated individuals and family-based studies. The new analysis approaches will be integrated in our software packages. The new approaches will support the search for disease loci in the human genome which will lead to a better understanding of the pathways for complex diseases and ultimately to their treatment.

Public Health Relevance

Sequencing data contains the information that is needed to identify the causal genetic loci for complex diseases and phenotypes. However, to translate this wealth of information into the discovery of disease loci, novel statistical analysis approaches are required. While the current analysis methodology remains valid, they are not optimally designed to look at rare variants and sequence data. We will develop statistical tools that are robust against confounding in rare variant data and that can identify the locations of the disease loci in sequencing data. This important information will support the search for disease pathways and their cure.

National Institute of Health (NIH)
National Institute of Mental Health (NIMH)
Research Project (R01)
Project #
Application #
Study Section
Behavioral Genetics and Epidemiology Study Section (BGES)
Program Officer
Addington, Anjene M
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Hecker, Julian; Xu, Xin; Townes, F William et al. (2018) Family-based tests for associating haplotypes with general phenotype data: Improving the FBAT-haplotype algorithm. Genet Epidemiol 42:123-126
Loehlein Fier, Heide; Prokopenko, Dmitry; Hecker, Julian et al. (2017) On the association analysis of genome-sequencing data: A spatial clustering approach for partitioning the entire genome into nonoverlapping windows. Genet Epidemiol 41:332-340
Hecker, Julian; Maaser, Anna; Prokopenko, Dmitry et al. (2017) Reporting Correct p Values in VEGAS Analyses. Twin Res Hum Genet 20:257-259
Schlauch, Daniel; Fier, Heide; Lange, Christoph (2017) Identification of genetic outliers due to sub-structure and cryptic relationships. Bioinformatics 33:1972-1979
Prokopenko, Dmitry; Hecker, Julian; Silverman, Edwin K et al. (2016) Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics 32:1366-72
Prokopenko, Dmitry; Hecker, Julian; Silverman, Edwin et al. (2015) Using Network Methodology to Infer Population Substructure. PLoS One 10:e0130708
Hecker, Julian; Prokopenko, Dmitry; Lange, Christoph et al. (2015) On the Recombination Rate Estimation in the Presence of Population Substructure. PLoS One 10:e0145152
Erk, Susanne; Meyer-Lindenberg, Andreas; Linden, David E J et al. (2014) Replication of brain function effects of a genome-wide supported psychiatric risk variant in the CACNA1C gene and new multi-locus effects. Neuroimage 94:147-154
Qiao, Dandi; Cho, Michael H; Fier, Heide et al. (2014) On the simultaneous association analysis of large genomic regions: a massive multi-locus association test. Bioinformatics 30:157-64
Naylor, Melissa G; Cardenas, Valerie A; Tosun, Duygu et al. (2014) Voxelwise multivariate analysis of multimodality magnetic resonance imaging. Hum Brain Mapp 35:831-46

Showing the most recent 10 out of 37 publications