In the recent years, the process of mapping the genetic determinants of disease has been employing statistical analyses to an increasingly large extent. Prior to availability of direct DNA sequencing, genotyping of individuals for association analysis was predominantly limited to loci that are known to be polymorphic in the general population. Genetic variants had to be discovered first, then assessed in samples of individuals with and without the condition. Considering single nucleotide polymorphisms (SNPs) with two alleles, the less frequent of the two alleles (the minor allele) had to be common enough in the population in order to be discovered. Thus, by design, genetic association studies relied on SNPs with both alleles being relatively frequent. One consequence of this design was a limitation that those alleles that occur primarily in individuals with a condition would not be scored in a study. The hope was that common variants could be in linkage disequilibrium (that is, correlated) with untyped causal variants and serve as proxies for them, but the difference in frequencies between the common proxy variant and the rare causal one imposes strict bounds on how large the correlation can be and therefore leads to loss of statistical power. Thus, condition-specific rare variants could have been completely missed by such analysis. With the current availability of sequencing data a new challenge is the need for new statistical methods for aggregating signals across rare variants in a region. Even if frequencies of certain rare variants are enriched in individuals with a condition, it is still expected to be low, and one at a time analysis of such variants is ineffective. To address these needs, we have been developing statistical tools within the functional linear models framework that allows for simultaneous evaluation of all variants within a genetic region, both rare and common as well as adjustment for environmental covariates. Recent technological advances led to drastically increased amounts of genetic data available to researchers. This resulted in an unprecedented escalation of the number of statistical hypotheses routinely tested in a single study. Rather than follow a carefully crafted set of scientific hypotheses with statistical analysis, researchers can now test many possible relations and let P-values or other statistical summaries generate hypotheses for them. Driven by these advances, testing a handful of genetic variants in relation to a health outcome has been largely abandoned in favor of agnostic screening of the entire genome followed by selection of most significant results. The overwhelming majority of statistical testing is being done using the traditional framework of significance testing in which the evidence of every test is summarized via a P-value. The P-value is then compared to a significance threshold, adjusted to accommodate the number of tests in a study. Partly due to their widespread use, P-values have been at the center of replicability crisis. Inherent uncertainty associated with statistical inference imposes limitations on reliability of conclusions that can be drawn from data, but misuse of statistical methods and summaries is a growing concern. Significance, hypothesis testing and the accompanying P-values are being scrutinized as representing most widely applied and abused practices. Rather than adopting the view that P-values should be abandoned because they are poorly suited for what they are used for in practice, we have been developing statistical methods for extracting information from them in such a way that when augmented with the external (prior) information about the effect size distribution, P-value can be transformed into a complete posterior distribution for a standardized effect size. Our recent research has been in developing methods for converting summary information contained in test statistics and P-values of experiments with massive multiple testing into estimates of credibility of hypotheses. We took special care to allow that the effect size distribution can be specified flexibly, in ways not limited to convenience (conjugate) priors, or to specific parametric distributions. For example, we should be able to accommodate the expectation that the bulk of SNPs in genome wide studies carry close to zero effect sizes. We should be able to handcraft an effect size distribution with bulk of its density around zero. Moreover, methods have been emerging for estimation of disease-specific effect size distributions from GWAS and replication studies. These methods allow to estimate effect size distributions in a tabulated way, where each range of the effect size would be accompanied by its estimated frequency in the genome. Such empirically estimated distributions are not necessarily expected to follow any standard or symmetric distributions, such as a normal distribution. Thus, more flexibility is needed, and the methods we have been developing allow researchers to incorporate empirically estimated distributions directly, rather than fitting them to a pre-specified standard distribution. As part of evaluation of our methods, we have been developing theory that predicts expected behavior of these posterior estimates. In particular, we derived quantifications of the relation between the number of tests in a study and the proportion of real signals expected to be contained in a set of the smallest P-values.

Project Start
Project End
Budget Start
Budget End
Support Year
12
Fiscal Year
2016
Total Cost
Indirect Cost
Name
U.S. National Inst of Environ Hlth Scis
Department
Type
DUNS #
City
State
Country
Zip Code
Martin, Loren J; Smith, Shad B; Khoutorsky, Arkady et al. (2017) Epiregulin and EGFR interactions are involved in pain processing. J Clin Invest 127:3353-3366
Vsevolozhskaya, Olga; Ruiz, Gabriel; Zaykin, Dmitri (2017) Bayesian prediction intervals for assessing P-value variability in prospective replication studies. Transl Psychiatry 7:1271
Vsevolozhskaya, Olga A; Kuo, Chia-Ling; Ruiz, Gabriel et al. (2017) The more you test, the more you find: The smallest P-values become increasingly enriched with real findings as more tests are conducted. Genet Epidemiol 41:726-743
Dong, Jing; Wyss, Annah; Yang, Jingyun et al. (2017) Genome-Wide Association Analysis of the Sense of Smell in U.S. Older Adults: Identification of Novel Risk Loci in African-Americans and European-Americans. Mol Neurobiol 54:8021-8032
Shi, Min; O'Brien, Katie M; Sandler, Dale P et al. (2017) Previous GWAS hits in relation to young-onset breast cancer. Breast Cancer Res Treat 161:333-344
O'Brien, Katie M; Shi, Min; Sandler, Dale P et al. (2016) A family-based, genome-wide association study of young-onset breast cancer: inherited variants and maternally mediated effects. Eur J Hum Genet 24:1316-23
Vsevolozhskaya, Olga A; Zaykin, Dmitri V; Barondess, David A et al. (2016) Uncovering Local Trends in Genetic Effects of Multiple Phenotypes via Functional Linear Models. Genet Epidemiol 40:210-221
Vsevolozhskaya, Olga A; Greenwood, Mark C; Powell, Scott L et al. (2015) Resampling-based multiple comparison procedure with application to point-wise testing with functional data. Environ Ecol Stat 22:45-59
Meloto, Carolina B; Segall, Samantha K; Smith, Shad et al. (2015) COMT gene locus: new functional variants. Pain 156:2072-83
Weinberg, Clarice R; Zaykin, Dmitri (2015) Response. J Natl Cancer Inst 107:

Showing the most recent 10 out of 29 publications