Big Data technology promises to improve people's lives, accelerate scientific discovery and innovation, and bring about positive societal change. Yet, if not used responsibly, this same technology can reinforce inequity, limit accountability and infringe on the privacy of individuals: irreproducible results can influence global economic policy; algorithmic changes in search engines can sway elections and incite violence; models based on biased data can legitimize and amplify discrimination in the criminal justice system; algorithmic hiring practices can silently reinforce diversity issues and potentially violate the law; privacy and security violations can erode the trust of users and expose companies to legal and financial consequences. The focus of this project is on using Big Data technology responsibly -- in accordance with ethical and moral norms, and legal and policy considerations. This project establishes a foundational new role for data management technology, in which managing the responsible use of data across the lifecycle becomes a core system requirement. The broader goal of this project is to help usher in a new phase of data science, in which the technology considers not only the accuracy of the model but also ensures that the data on which it depends respect the relevant laws, societal norms, and impacts on humans.

This project defines properties of responsible data management, which include fairness (and the related concepts of representativeness and diversity), transparency (and accountability), and data protection. It complements what is done in the data mining and machine learning communities, where the focus is on analyzing fairness, accountability and transparency of the final step in the data analysis lifecycle, and considers the problems that can be introduced upstream from data analysis: during dataset selection, cleaning, pre-processing, integration, and sharing. This project develops conceptual frameworks and algorithmic techniques that support fairness, transparency and data protection properties through all stages of the data usage lifecycle: beginning with data discovery and acquisition, through cleaning, integration, querying, and ultimately analysis. The contributions are structured along three aims. Aim 1 considers responsible dataset discovery, profiling, and integration. Aim 2 considers responsible query processing and develops a general framework for declarative specification, checking and enforcement of fairness, representativeness and diversity. Aim 3 incorporates data protection into the lifecycle, develops techniques to facilitate sharing of sensitive data, and considers the tradeoffs between privacy and transparency. This project is poised to establish a multidisciplinary research agenda around responsible data management as a critical factor in enabling fairness, accountability and transparency in decision-making and prediction systems. Additional information about the project is available at DataResponsibly.com.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1741254
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2017-09-01
Budget End
2021-08-31
Support Year
Fiscal Year
2017
Total Cost
$365,000
Indirect Cost
Name
University of Massachusetts Amherst
Department
Type
DUNS #
City
Hadley
State
MA
Country
United States
Zip Code
01035