Data-driven science relies not only on statistics and machine learning, but also on human expertise. As data are being collected to tackle increasingly challenging scientific and medical problems, there is need to scale up the amount of expert human input (curation and, in certain cases, annotation) accordingly. This project addresses this need by developing collaborative data curation: instead of relying on a small number of experts, it enables annotations to be made by communities of users of varying expertise. Since the quality of annotations by different users will vary, novel quantitative techniques are developed to assess the trustworthiness of each user, based on their actions, and to distinguish trustworthy experts from unskilled and malicious users. Algorithms are developed to combine users' annotations based on their trustworthiness. Collaborative data curation will greatly increase the amount of human annotated data, which will, in turn, lead to better Big Data analysis and detection algorithms for the life sciences, medicine, and beyond.

The central problems of collaborative data curation lie in the high variability in the quality of users' annotations, and variability in the form the data takes when they annotate it. The proposal develops techniques to take annotations made by different users over different views of data (such as an EEG display with filters and transformations applied to the signal), to use provenance to reason about how the annotations relate to the original data, and to reason about the reliability and trustworthiness of each user's annotations over this data. To accomplish this, the research first defines data and provenance models that capture time- and space-varying data; novel reliability calculus algorithms for computing and dynamically updating the reliability and trustworthiness of individuals, based on their annotations and how these compare to annotations from recognized experts and the broader community; and a high-level language called PAL that enables the researchers to implement and compare multiple policies. The researchers will initially develop and validate the techniques on neuroscience and time series data, within a 900+ user public data sharing portal (with 1500+ EEG and other datasets for which annotations are required). The project team later expands the techniques to other data modalities, such as imaging and genomics

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1547360
Program Officer
Robert Beverly
Project Start
Project End
Budget Start
2015-09-01
Budget End
2019-08-31
Support Year
Fiscal Year
2015
Total Cost
$500,000
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104