As data becomes an essential driver of technological and economic developments, it is critical to understand the value of data in different applications. This project develops a computational approach to quantify what type of data is more or less useful when the data is used to train prediction algorithms. This characterization of data value is important because it enables users to filter out poor quality data and to identify data that are important to collect in the future. In addition to data valuation, the project also develops complementary methods to facilitate deleting data from prediction algorithms. This would allow users to quickly remove poor quality data or data that might have privacy concerns from algorithms. Data valuation and data deletion are core aspects of recent policies aimed to enable individuals control over how their data is used and monetized by third-parties. The methods developed in this project can inform the implementation of such policies.

This project develops a framework for data valuation based on extending the concept of Shapley value from economics. Shapley value measures how individual components contribute to the whole group. This project will build a rigorous statistical theory of data Shapley value, together with new scalable algorithms for estimating Shapley values on large datasets. Moreover, modifications to data Shapley value by relaxing its constraints will be investigated. Computing data Shapley value involves iteratively deleting certain data points and measuring the effect of this deletion on the performance of the trained machine learning model. This formulation closely links data valuation with the data deletion subproject. The goal of the latter is to efficiently delete subsets of the training data from a machine learning model without having to retrain from scratch. The data valuation and deletion methods will be implemented and validated on large publicly available biomedical datasets.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1942926
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2020-06-15
Budget End
2025-05-31
Support Year
Fiscal Year
2019
Total Cost
$102,552
Indirect Cost
Name
Stanford University
Department
Type
DUNS #
City
Stanford
State
CA
Country
United States
Zip Code
94305