This is an EAGER proposal to support 2 graduate students in research which responds to an immediate need concerning critical questions focused on transferring, managing, and the organizational issues of large digital datasets. The basis of the project capitalizes on a very important event: The Sloan Digital Sky Survey (SDSS) scientific dataset is being transferred from one repository to two others, from a national laboratory to a university library and also to a university-based group of astronomers - from one kind of workforce to two others. This is a pivotal moment to study in the organization of digital knowledge in astronomy and how that knowledge will be developed.

The transfer of the database between the three workforces, although highly planned, will face multiple strategic difficulties that will require members of all workforces to develop a new form of knowledge at the interface of their different practices. Those who become adept at this interface of knowledge and data transfer will possess knowledge crucial to the designers of data repositories and those who make use of them. Understanding changes, organizational structures, identifying differences and managing cultural and technical issues can be extremely informative to proposed approaches in the cross-disciplinary/cross-workforce science of the future.

Project Report

Knowledge and Data Transfer: the Formation of a New Workforce studied the social and technical processes associated with the transfer of the Sloan Digital Sky Survey (SDSS), a major large-scale scientific dataset, from one repository and one kind of workforce, to another. The SDSS is a groundbreaking astronomical survey covering over a quarter of the night sky with high quality optical and spectroscopic imaging. As an open data project, it became a major resource for astronomy research, education, and for use by the general public. The SDSS is known for the quality of its data, which are carefully calibrated and curated prior to each data release. Community concern for continuing access to these data led to an agreement between the SDSS project and a major U.S. university digital library to transfer the SDSS data for long-term stewardship. Moving a very large scientific data resource (approximately 130 terabytes) from a functioning interactive system located at a major research center to a dark archive at a university research library had never been attempted. This project studied how the data transfer process was designed, conducted, and evaluated, and the expertise involved at each stage. We learned that the curation of scientific data in digital form involves a diverse array of expertise and workforce roles. Transfer activities were far more complex and labor-intensive than anticipated by the teams involved. We focused most closely on the work of the recipient team, which had considerable experience with digital libraries, although not with astronomy data. Among the challenges they faced were interoperability problems between the former and future database architectures, policy differences between the partners, scoping "the data" to be transferred, and the balance of technical and domain expertise required. Differences in technology and policy led to moving millions of small files over a secure network, all of which had to be verified to archival standards. New tools were identified and available tools were adapted. New data management practices were invented to support the process. Collaborations brought together experts in digital library organization, archival practice, data practice, astronomy, computer science, software engineering, mathematics, and statistics. No single individual had the array of expertise necessary, either at the beginning or later stages of the process. Over the course of this five-year data transfer project, staff members gained new kinds of expertise. They also came to appreciate the many perspectives on "the data" in the Sloan Digital Sky Survey, as viewed by astronomers, libraries, technologists, and others. As the SDSS was moving from an active system to a dark archive – although other copies of the dataset exist elsewhere – the functionality of the dataset also was changing considerably in the transfer process. Our findings show that the scale of research data becomes a problem in itself. Tasks that are mundane on smaller datasets become surprisingly difficult problems with datasets the size of the SDSS. Collecting, calibrating, organizing, analyzing, accessing, curating, preserving, and transferring big research datasets requires multiple kinds of expertise that must be coordinated effectively. Some curation activities depended upon knowledge of storage mechanisms, independent of the domain of the data, while others required substantial knowledge of astronomy. Data curation, open data, data management plans, and most other aspects of data stewardship will depend upon the availability of a workforce with the appropriate skills and expertise. The array of skills and expertise necessary for managing scientific data, large and small, is not yet well understood. We found that multiple workforces are involved and that none of them have the full complement of expertise necessary. Our findings will inform the development of the workforce for managing and curating research data in scientific, library, and archival settings. They also offer insights for those involved in the stewardship of large datasets in astronomy and in other domains.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1145888
Program Officer
Robert Chadduck
Project Start
Project End
Budget Start
2011-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2011
Total Cost
$90,000
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095