Collaborative data analysis has become a necessity and trend in the era of big data. In such collaborative environments, intellectual property protection mechanisms are critical to maintain and encourage research partnerships. Such mechanisms shall protect not only data sources and data analysis algorithms, but also protect data provenance, i.e., data processing history. For example, participating parties may request to secure their access and sharing of various kinds of data products (source, intermediate, and final), processing steps, and their inter-dependencies. However, existing mechanisms do not provide such fine-grained protection on multi-step data analytics procedure (workflow) provenance. To address such a challenge, this project aims to study and explore novel mechanisms to secure the access and querying over collaborative scientific workflow provenance.
The technical goal of this project is to understand in depth about the feasibility of dataflow provenance-oriented access and querying mechanisms. This high-risk, high-reward work will produce the following two outcomes: (1) a multi-level fine-grained secure provenance access and querying mechanism for provenance collection, including sensitive data as well as sensitive dependencies between data, tasks, and users, and (2) automated analysis algorithms to ensure that provenance access and querying policies would conform to desirable constraints on evolving dependencies. The intended techniques and tool will be evaluated in genomic data analysis domain to demonstrate its usability and significance in the context of collaborative data analytics. The expected techniques will be equipped to the NSF-sponsored collaborative scientific workflow tool for secure data analytics collaboration.