This project seeks to develop methods to quantify uncertainty in machine learning algorithms and to incorporate machine learning and statistical inference. Machine learning has been enormously successful at using data to make predictions; it is used in an extensive range of applications from handwriting recognition to high frequency trading to driverless cars and personalized medicine. However, while machine learning algorithms make good predictions, they tell humans very little about how those predictions were arrived at: What were the important factors? How did they affect the prediction? They also don't distinguish predictions for which there is a lot of information about the probability of different outcomes (even if that covers a wide range) from those where very little information is available. For example, a machine learning algorithm may very accurately predict whether a person is likely to develop diabetes, but provides little if any information regarding how that person might lower his or her risk. This project will build on initial mathematical theory to develop methods to explain how Random Forests arrive at their predictions and how statistically confident those predictions are, and produce ways to link machine learning methods to other statistical models.

This project seeks to develop methods to quantify uncertainty in machine learning algorithms and to incorporate machine learning and statistical inference. The project will extend on a theoretical framework representing Random Forests as U-statistics to produce a practical implementation of statistical uncertainty quantification in machine learning. In particular, it will improve on methods to estimate sample variability in Random Forest predictions, develop computationally efficient screening tools for covariate and interaction selection, and incorporate ensemble methods as non-parametric terms in partially-linear models while retaining statistical inference via a modified boosting algorithm. These methods will be demonstrated on a citizen science data base in ornithology and in various biomedical applications.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1712554
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2017-09-01
Budget End
2021-08-31
Support Year
Fiscal Year
2017
Total Cost
$215,078
Indirect Cost
Name
Cornell University
Department
Type
DUNS #
City
Ithaca
State
NY
Country
United States
Zip Code
14850