Scientific specimens, typically found in museum collections, serve as the anchor for an expanding array of information that grows and changes over time. This information, about specimens and the species that the specimens represent, is often scattered geographically across institutions and across independent computer systems, making it difficult to access or synthesize. The goal of this project is to develop a two-way system of linking and tracking scientific specimens and specimen-related data across biological collections, and to make this system widely available to the scientific community and the public. This system would employ globally unique identifiers, or GUIDs, to tag and update information associated with specimens, allowing communication between end users and collections. This project will improve data quality and quantity for non-scientists and scientists, and will actively engage use communities through training workshops, summer student internships, and community BioBlitz enhancements.

The ability to integrate specimen data and associated information across biological collections will enable critical studies related to systematics, biogeography, and changing species distributions. These in turn have implications for climate change, changing land use, and other questions key to understanding the past, placing changes in an historical context, and predicting the future of species and environments. This project is part of a 10-year effort to digitize and mobilize the scientific information associated with biological specimens held in U.S. research collections. The images and digitized data from this project will be integrated into the online national resource as outlined in the community strategic plan available at http://digbiocol.files.wordpress.com/2010/05/digistratplanfinaldraft.pdf.

Project Report

Natural history collections provide irreplaceable legacy information about our biosphere in an era of rapid change. At the heart of these collections are specimens, which have been collected in nautre, and brought into museums for perpetuity. These specimens, and associated data collected in the field, and during downstream accessioning and curatorial processes, continue to provide new value long after they are first accessioned. For example, these specimens yield new derivatives, such as tissues used in genetic and genomic analyses, digital records that can be mapped en masse for reconstructing current and past species distributions, and images and other digital content that can help in quantifying the shape and size of biodiversity. As biocollections continue to be rapidly digitzed and mobilized, a critical challenge is assuring that the digital content about specimens remains connected together. For example, an analog label associated with a specimen may be digitized in order to gather digital content about taxonomy, data and location of the collecting event and other associated content. This record might be further processed in order to conform to community data standards, and published online as part of ongoing efforts at mobilization. That same specimen at a later date might be subsampled for a tissue and those tissues processed for genomic DNA and ultimately used for genetic or genomic analyses. All of these contents need to be made available in repositories to support initial uses and re-use, and as importantly, they need to all easily and directly point back to the same original specimen. The BiSciCol project is a multi-institutional collaboration developed to tackle the challenge of "tagging and tracking" specimens and their many derivatives, in the same way that packages shipped via the US mail system, have tracking numbers. The challenge BiSciCol faced is that biocollections practice has been built around utiilizing local identifiers within institutions, rather than Internet-scale globally unique identifiers. Past practices had also led to specimen records that used to share a common history becoming disconnected when brought into separate insttutions. As well, there was inconsistent (at best) practices among the community when reporting specimen identifiers associated with derivative digital data such as sequences. Finally, the current systems are built around data moving to repositories such as iDigBio or Genbank or Barcode of Life Data Systems in flat-file formats that impede ability to link data together in new ways. BiSciCol produced a suite of next step solutions to these structural and functional deep problems with the manner in which biocollections data propagate out from the originally collected objects. These solutions ranged across a broad space, and included the following components: 1) A collaboration with the California Digital Library (CDL) to produce a ratiional solution to the problem of uniquely identifying specimen records. Unlike images or PDFs, specimen records that point to a physical object are often bundled together as datasets, with each row pointing backto a specimen. As well, there are likely over a billion such records, straining the capacity of systems such as digital object identifiers (DOIs) commonly used. The solution with CDL focuses on identifying datasets, while also allowing for indiviidal record identification by "passing through" that record level ID when resolving that identifier. The end result is that any record level ID can be "resolved" by a user to find out content related to the ID. 2) The BiSciCol team also studied current practices and isolated the problems in how identifiers are currently being used, especially noting that attempts to build systems where linkages between specimen data and sequence data were kept intact are currently not successful due to lack of strong curatorial practice in entering and maintaining those identifiers into systems. This work led to a publication in the open access journal PLOS ONE. 3) The BiSciCol team developed software called "The Triplifer" and "The Triplifer Simplifier" that takes input biodiversity in any format, typically in spreadsheets, text files and databases, and creates ready-made outputs in proper formats (RDF, or resource description framework) useful in more effectively linking data together. The "Simplifier" part of the toollkit takes the most common interchange format for biocollection data records, called Darwin Core Archives, and produces more properly linked data based on implicit interrelationships in this interchange data. This led to the publication about the tool in the open access journal BMC Bioinformatics. 4) The BiSciCol team has extended from implicit connections within and among datasets to modeling those more explicitly using burgeoning ontology approaches. BiSciCol team members were part of a publication on this new approach in PLOS ONE. Although the BiSciCol grant has ended, it has led to new work that is ongoing and has had its intended impacts on the biocollections and biodiversiy informatics community.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
0956350
Program Officer
Anne Maglia
Project Start
Project End
Budget Start
2010-10-01
Budget End
2014-09-30
Support Year
Fiscal Year
2009
Total Cost
$210,448
Indirect Cost
Name
University of Colorado at Boulder
Department
Type
DUNS #
City
Boulder
State
CO
Country
United States
Zip Code
80309