Scientific specimens, typically found in museum collections, serve as the anchor for an expanding array of information that grows and changes over time. This information, about specimens and the species that the specimens represent, is often scattered geographically across institutions and across independent computer systems, making it difficult to access or synthesize. The goal of this project is to develop a two-way system of linking and tracking scientific specimens and specimen-related data across biological collections, and to make this system widely available to the scientific community and the public. This system would employ globally unique identifiers, or GUIDs, to tag and update information associated with specimens, allowing communication between end users and collections. This project will improve data quality and quantity for non-scientists and scientists, and will actively engage use communities through training workshops, summer student internships, and community BioBlitz enhancements.

The ability to integrate specimen data and associated information across biological collections will enable critical studies related to systematics, biogeography, and changing species distributions. These in turn have implications for climate change, changing land use, and other questions key to understanding the past, placing changes in an historical context, and predicting the future of species and environments. This project is part of a 10-year effort to digitize and mobilize the scientific information associated with biological specimens held in U.S. research collections. The images and digitized data from this project will be integrated into the online national resource as outlined in the community strategic plan available at http://digbiocol.files.wordpress.com/2010/05/digistratplanfinaldraft.pdf.

Project Report

Outcomes BiSciCol At the University of Arizona, a majority of the work on the Biological Sciences Collection Tracking project has been to develop methods to facilitate and evaluate the creation of structured database records in extended Darwin Core from images of specimen labels from museums using records created from Optical Character Recognition. This was coordinated though a working group at iDigBio at the University of Florida called Augmented Optical Character Recognition Working Group. The team produced a reference collection of images and associated OCR from three collections including Lichens, Herbaceous plants and entomology. The team then created a set of human transcriptions of the labels to resented a gold standard for OCR. This reference set was used to evaluate ABBYY, Tesserac and OCRpus OCR. Results were presented at the iConference. Generally ABBYY was superior to unmodified versions of Tesserac and OCRapus. The AOCR team then created a set of CSV files associated with those label images. The columns of these records indicate the Dublin Core fields for different components of labels. These also included an extended set of fields not normally included in DwC. The reference collections can be found on GitHub https://github.com/idigbio-aocr/label-data. The Arizona team used a set of machine learning algorithms to predict classification of subelements of the museum labels in extended Darwin Core. These algorithms including Hidden Markov Models (HMM) require training sets in an ordered XML format. A set of programs were developed that use heuristics and pattern matching to take the unordered CSV files from the step above and produce ordered XML. These programs can be found at https://github.com/BryanHeidorn/LABELX. Java programs to train and apply HMM can also be found at this GitHub repository. The output form this parsing process includes both XML and CSV serializations. The XML includes the addition of EZID GUIDs for crosslinking to other resources associated with the specimens. The AOCR working group also developed a set of algorithms to evaluate the performance of OCR engines as well as parser that format the OCR text into Extended Darwin Core. These tools can be used by later projects to evaluate the performance of new parsers as they become available, https://github.com/idigbio-aocr/scoring.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
0956271
Program Officer
Anne Maglia
Project Start
Project End
Budget Start
2010-10-01
Budget End
2014-09-30
Support Year
Fiscal Year
2009
Total Cost
$229,067
Indirect Cost
Name
University of Arizona
Department
Type
DUNS #
City
Tucson
State
AZ
Country
United States
Zip Code
85721