This award supports the NSF Workshop: Data Curation: Ensuring Quality and Access to Enable New Science, to be held in September 2012 in Arlington, VA. The value of data to the global economy has been well-documented and spawned calls for training professionals who practice data curation and stewardship, data analytics, and "big data" management. It is evident that poor data is worse than no data because it wastes time, leads to poor science and decisions, and diminishes trust in the entire data enterprise. Data curation demands tools and techniques at each phase of the data life cycle that lead to effective and efficient data services that people trust. This workshop brings together leading researchers in data curation to establish a research agenda to guide development of these tools and techniques.
This workshop will have impact on the emerging data curation research and development community by defining directions for tools and techniques that support selection, metadata annotation, storage, access, use and reuse, and preservation of scientific and scholarly data. Such tools and techniques will make science and scholarship more effective and may be adapted to personal data management applications such as personal health or educational records. The workshop web site (http://datacuration.web.unc.edu/) provides will be used to disseminate further information, including the resulting workshop report that will provide a roadmap for the future data curation research and follow-up activities.
Science is built on observations. If our observational data is bad, we cannot trust the results that come from these observations. Data quality is an assertion about data properties, typically assumed within a context defined by a collection that holds the data. The assertion is made by the creator of the data. The collection context includes both metadata that describe provenance and representation information, and procedures that are able to parse and manipulate the data. However data quality from the perspective of users is defined based on the data properties that are required for use within their scientific research. The user believes data is of high quality when assertions about compliance can be shown to their research requirements. Digital data can accumulate rich contextual and derivative data as it is collected, analyzed, used, and reused, and planning for the management of this history requires new kinds of tools, techniques, standards, workflows, and attitudes. As science and industry recognize the need for digital curation, scientists and information professionals recognize that access and use of data depend on trust in the accuracy and veracity of data. In all data sets trust and reuse depend on accessible context and metadata that make explicit provenance, precision, and other traces of the datum and data life cycle. Poor data quality can be worse than missing data because it can waste resources and lead to faulty ideas and solutions, or at minimum challenges trust in the results and implications drawn from the data. Improvement in data quality can thus have significant benefits. The National Science Foundation sponsored a workshop on September 10 and 11, 2012, in Arlington, Virginia on "Curating for Quality: Ensuring Data Quality to Enable New Science." Individuals from government, academic and industry settings gathered to discuss issues, strategies and priorities for ensuring quality in collections of data. This workshop aimed to define data quality research issues and potential solutions. The workshop objectives were organized into four clusters: (1) data quality criteria and contexts, (2) human and institutional factors, (3) tools for effective and painless curation, and (4) metrics for data quality. In addition to the contributed papers and breakout discussions, the workshop also yielded insights on several high-level themes. These include: There are many perspectives on quality: quality assessment will depend on whether the agent making the assessment is a data curator, curation professional, or end user (including algorithms); quality can be assessed based on technical, logical, semantic, or cultural criteria and issues; and quality be assessed at different granularities that include data item, data set, data collection, or disciplinary repository. This implies that assessments of quality must carefully specify underlying assumptions and conditions under which the assessment was made. There is movement toward more nuanced models of data control and curation such as maturity levels (matrix models) that consider levels of stability and quality across different criteria and perspectives. The workshop identified several key challenges that include: selection strategies—how to determine what is most valuable to preserve how much and which context to include—how to insure that data is interpretable and usable in the future, what metadata to include tools and techniques to support painless curation—creating and sharing tools and techniques that apply across disciplines cost and accountability models—how to balance selection, context decisions with cost constraints.