As the amount and variety of data available online explodes, it is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, this project will develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) for analytical queries. This will help to better understand answers in the presence of incomplete information across fields ranging from business and the military to medical applications.

This project will develop and exploit the following paradoxical statistical phenomenon: the ability to see certain data items more than once (across multiple data sets) enables one to estimate parameters of data items that have never been seen at all. This project will therefore develop new statistical techniques which take advantage of overlapping datasets, and software backed by both theory and experiments. This will enable users with overlapping incomplete data sets to actively "see the unseen," and in many cases perform as though they had access to missing information not represented in any of their data sources. The project will also focus on data validation, and how to use multiple unreliable data sources to correct each other. Further, as the proposed analysis is nuanced and novel, the project will also explore how to best convey valuable insights to the user, via interactive visualizations of the predictions. For further information see the project web site at: http://unknown-unknowns.cs.brown.edu

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
2033792
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2020-04-01
Budget End
2021-05-31
Support Year
Fiscal Year
2020
Total Cost
$334,537
Indirect Cost
Name
Massachusetts Institute of Technology
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02139