The rapidly increasing number of interconnected devices and systems brings unprecedented collaborative opportunities between researchers, business partners, and healthcare organizations that can extend discoveries beyond those derivable from any single study. For instance, data sharing between various medical organizations can enhance understanding of the results from an individual clinical by polling of results from various trials from multiple organizations, and thus it enables extending the analysis of treatment options and accelerating biomedical research. While a vast amount of data collected from various sources brings us benefits, it imposes, at the same time, an important challenge of ensuring trustworthiness and quality of data due to the integration of disparate data from various sources. Furthermore, faulty, improperly configured, or broken sensors, as well as buggy or compromised data processing units, can severely affect the quality of data and the analyzed results. This project proposes to develop an infrastructure that provides robust, fine-grain, and end-to-end provenance for collaborative data sharing and analytics. The outcome of this research will directly serve as the foundation of trustworthy data sharing and analytic infrastructures by providing robust and attack/fault-resilient fine-grain provenance of the shared data by creating fine-grain end-to-end data provenance framework for diverse communication infrastructures.
In this project, the PIs will develop an end-to-end data provenance framework that provides robust and fine-grain data lineage for trustworthy data sharing and analytic infrastructures. First, they will develop a scalable and reliable infrastructure for collecting data provenance for distributed interconnected devices that can derive a concise provenance data in various environments of distributed devices (e.g., devices using diverse hardware and software platforms). Next, they will design and implement a framework to enable proper derivation and propagation of fine-grain provenance records for data sharing, processing, and analytics. It will provide services for provenance tracking as well as provenance record processing when the data are aggregated, analyzed, and processed (e.g., merge, split, duplicate, delete, extract, or statistical analytics). The PIs also plan to develop a system that can analyze and visualize the complete lineage of data to measure the quality of data and conduct various root cause analyses to identify the fundamental reasons behind data quality issues. The system will be capable of handling a large amount of data generated over a long period of time across multiple devices.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.