Messy data is ubiquitous in modern science. Data come from heterogeneous sources; there are many latent confounding factors; and it is often unclear what are the relevant questions to ask and models to use. This reality is in sharp contrast with the usual modeling assumptions of machine learning and statistics, where data are assumed to come from well-specified models and the hypotheses to test are clearly laid out. The glaring gap between standard theory and the actual practice of messy data is a major contributor to the reproducibility crises across science and prevents researchers from harnessing the full insights from data. This project will develop rigorous mathematical foundations and robust machine learning algorithms to address the core challenges of messy data. The PI will explore novel techniques to quantify and reduce different types of selection biases that arise from exploratory data analysis. The PI will also investigate algorithms to perform statistical inference when the model is mis- or under-specified. The project will apply these new methods to tackle challenging problems in human population genomics.

The PI recently initialized a framework based on information usage to quantify the magnitude of over-fitting and bias arising from data exploration. This project will significantly expand this framework. In particular the PI will apply this information usage approach to quantify and reduce bias in data generated from adaptive experimentation, such as online A/B testing and more general multi-arm bandits. Related to over-fitting is the problem of mis- and under-specified statistical models. The PI has recently developed method-of-cumulant approaches to learn probabilistic models when the observations are perturbed by unknown and arbitrary interference. A promising direction of research is to extend this approach to more general settings that allow for nonlinear interference and to develop software tools for the broad data science community. Genomics exemplify many of the challenges of messy data-genomic data typically requires substantial exploratory analysis and faces modeling uncertainty. This makes genomics a high impact domain to apply the new messy data algorithms developed here. Bio-medical databases are interactively analyzed by many researchers and thus are particularly prone to exploration bias and overfitting. The PI will explore piloting the information usage framework on the bio-medical data hubs being created at Stanford in order to quantify and reduce exploration bias. As a part of the project, PI is also developing courses, workshops and tutorials to bring together researchers and practitioners across machine learning, statistics, information theory and bio-medical data science to address the ubiquitous challenge of messy data.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1657155
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2017-05-01
Budget End
2019-08-31
Support Year
Fiscal Year
2016
Total Cost
$174,999
Indirect Cost
Name
Stanford University
Department
Type
DUNS #
City
Stanford
State
CA
Country
United States
Zip Code
94305