The volume of data produced by computational modeling and analysis places increasing demands on researchers, IT staff, and IT infrastructure to store, manage, and move data. At the same time, funding agencies are enforcing and increasing requirements for management of data produced through grant funded projects. These pressures come together to put our institutions in a situation where they do not have established funding and best practices for managing this volume of data and, in addition, researchers and IT professionals do not have the data curation expertise to select data and create metadata to ensure the long term preservation and discoverability of important data. There is a strong need to bring these communities together with the library/archive community to consider data lifecycle management and to develop long term funding and data curation strategies that will help institutions to meet the increasing needs.
The objective of this workshop is to bring together researchers, campus Information Technology (IT) leaders, and library/archive specialists to discuss the topic of data lifecycle management specifically as it relates to computational science and engineering research data. This discussion will result in a common understanding of best practices and funding models for selecting, storing, describing, and preserving this digital data. The workshop will also help to cultivate partnerships between these communities to foster continued developments in the preservation and sharing of research data.
The recommendations and practices developed at this workshop will enable the more effective preservation and sharing of the huge quantity and volume of data sets produced by computational scientists and engineers. This workshop will directly impact the intellectual capabilities of our higher education institutions by initiating a sustained dialog into the lifecycle management of our research data.
Faculty and staff who attend the workshop will return to their institutions better prepared to lead the effort to develop and improve research data lifecycle management practices. As these practices are established on campuses across the country, data will become more available to all institutions including those from economically disadvantaged areas. The broader availability of data will benefit research and education, impacting students and researchers.
The following recommendations are the outcomes of the discussions at the Research Data Lifecycle Management workshop, July 18 – 20, 2011. Funding and Operation of Research Data Lifecycle Management • The National Science Board or similar entities should be engaged to interact with funding agencies to determine whether data preservation activities should be covered by the indirect cost pool defined for academic institutions. • Research community standards similar to records retention standards in business need to be clarified to allow future research data federation or partnerships to be established. • An initiative should be undertaken to provide a common (interdisciplinary) definition of data. • An initiative should be undertaken to outline phases of the data lifecycle. • An initiative should be undertaken to develop "taxonomy", since terms such as "preservation" may mean different things to different communities. • A research project should be performed to look at the existing collection of operational models for data lifecycle management. Partnering Researchers, IT Staff, Librarians, and Archivists • Complete a survey of organizational "models" or best practices for communication and interaction between researchers, IT staff, librarians and archivists. Assessment and selection of research data • Over the long-term, create workflows for data collection that include metadata creation and data selection which will make it easy for researchers to undertake these actions. • Add data management to the academic research methodology curriculum, perhaps as an enhancement to existing research compliance training. • Demonstrate to researchers how standardized archiving and retention policies will help make their lives easier. Policy • Universities should develop or clarify policies about data management including recommendations such as where data should be deposited. • Universities and related institutions should act in concert to develop policies about data ownership and responsibility in consultation with funding agencies. • Organize a workshop for senior research officers (VP’s or Deans for research and similar positions) and senior academic officers (Provosts and similar positions) to discuss data lifecycle management in order to elevate the visibility and importance of this topic with senior university administrators. • Create a catalog of issues such as data ownership, data restrictions, etc. for all disciplines in a quick guide format. • Organize a workshop for leaders of discipline communities to develop a common framework that could then be customized or extended to meet each community’s needs. Provide list of issues and possible solutions to communities to meet their standards and needs. Standards for Provenance, Metadata, Discoverability • Create a framework to share and receive data and metadata across disciplines, with confidence in the quality of data and metadata. When doing so, describe using principles as opposed to specific technologies (for example, describe trees and relationships rather than specifying RDF). • Design provenance metadata that cuts across different areas. • Create strategies for capturing metadata at various points in the data lifecycle, in an automated way, if possible). For instance, with sensor data, sensors could be designed to create better metadata. A frequently-used approach is to create a data model with project members, build use cases, and identify milestones. Observing data practices is very valuable for learning about an area and translating practice in the field into tools. • Use approach demonstrated by the genome sequencing community of developing community metadata standards, and having agencies fund only those researchers who follow the standards. Secure Research Data • Enquire about a national working group to guide compliance to various federal standards by research computing environments. • Catalog solutions for remote access to restricted data. • Catalog solutions for clinical translational study data (several university medical hospital/research groups have clinical translational study award (CTSA) programs). Partnering Funding Agencies, Research Institutions, and Communities combined with Industrial and Corporate Partnerships • Arrange for a trusted party do a survey of cost models of the data lifecycle and preservation available through the vendor community, for comparison to storage options available through academic institutions and federal agencies. • Provide suggestions and insight to NSF and other funding agency program directors on how to leverage regional, discipline-specific archives that already exist and how smaller schools with fewer resources can use larger institutions’ repositories. • Engage funding agencies and vendors in discussions about potential networking and regional repository solutions for moving around large data.