Most of the advances in science in the last 400 years come not merely from researchers working by themselves, but rather from a community of scholars cooperating and competing in pursuit of shared goals. Critical components of this community are built from scholarly citation, which turn isolated works into a network of scholarship that can be navigated and mined. For centuries, the outcome of such scholarly endeavors were written publications. With the coming of the digital age, new forms of scholarly output, such as data collections and digital publications have become commonplace. Unfortunately, the practices of citation and attribution that have been the mainstay of written publications are insufficient for this new digital world. A citation for digital data needs to be more descriptive than a reference to the location of the item; it needs to describe what the data is, where it came from, and how it was produced. This research will yield new techniques, tools and demonstrations of an extended citation service that uses data provenance, a formal record of how an object came to be in its current form. This extended citation service will facilitate activities such as research reproduction and attribution.
The provenance-enabled data citation system developed in this work will both be embedded in an existing data platform (specifically, Dataverse) as well as functioning as a standalone service. The system addresses the following set of specific data citation challenges: It directly includes executable transformations for a limited, but important set of tools: R and SQL. For other tools, it provides a standardized documentation capability to describe transformations. The system is sufficiently flexible to serve either as part of a publication workflow, where data is part of a more conventional publication, or in support of a standalone publication. It also provides data summaries.