In order for the scientific community to reach consensus on a scientific finding, the finding must be replicated in multiple studies by different groups. Unfortunately, not all scientific studies successfully replicate. When a scientific finding is reported, its confidence is quantified by a p-value. In principal, p- values should quantify how often a study should replicate. Over the past decade, researchers have shown that scientific studies replicate at a much lower rate than the reported p-values predict. This has led to a vigorous discussion on the causes of replication failures as well as developing guidelines for study design to improve the replication rate. In this project, the research team will show that when studies are collecting large amounts of data, it is possible to use this data to identify differences between the studies and gain some insight into why studies do or do not replicate. This information can be used to improve the individual studies and increase the replication rate of the resulting findings. As replication is a fundamental tool in scientific discovery, developing new approaches to analyzing replication studies will have an impact in many areas of science. The team has a long standing interest in involving undergraduate students in their research as well as working to broaden the diversity of participants.
In this project, the replicability of high dimensional studies is considered. In a high dimensional study, not only one p-value is reported, but typically thousands or even millions of p- values are reported in each study. Genomic studies are a motivating example of high dimensional studies as genomic data is inherently high dimensional and thus in genomic studies, a p-value is computed for each genomic features such as a gene expression level or genetic variant. Typically, in a genomic study, out of all of the p-values, only a small subset of them are considered significant (taking into account for multiple testing). When a replication study is performed, the features of interest are the features which were significant in the original study. The key idea behind this project is that there is information on all reported features, even those that are not significant. By analyzing them, insights can be obtained about the studies and these insights can both improve the replication rate as well as the analysis of each of the studies. The framework can be leveraged to address the following problems: (1) Reduce the effect of confounders in each replicate -- improving power and reducing false positives; (2) Accounting for ascertainment biases in the reported results; and (3) Interpreting the differences between each replicate or study to gain insights into the underlying causes of the difference. The approach will be evaluated using 5 genomic datasets.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.