The Global Biodiversity Information Facility (GBIF) is awarded a grant to cover the United States participation in the single largest biodiversity informatics initiative worldwide. GBIF provides access to over174 million primary biodiversity records and over 1 million name records, active development of informatics tools and software, and interaction with a global network of peers in biodiversity science and biodiversity informatics. GBIF provides a unique combination of rich biodiversity data and expertise in biodiversity informatics. GBIF offers a portal to the largest storehouse of primary, research-class biodiversity data?accessed from hundreds of institutions globally in a distributed fashion. GBIF-developed or GBIF-funded informatics tools are increasingly used in diverse informatics applications, ranging from dynamic ?encyclopedia of life? mashups to utilities for large-scale downloads of primary research data. This biodiversity collaboratory thus represents a catalyzing force for rich developments across an emerging field of informatics. The GBIF initiative offers a rich arena for global collaboration in biodiversity science. It includes dimensions of data repatriation to countries of origin, broad enabling of biodiversity conservation efforts, prioritizing areas for public health remediation, and other dimensions of policy-making. The GBIF effort has trained hundreds of scientists, both in the US and globally, in aspects of the emerging field of biodiversity informatics. As such, this initiative has rich global implications and positive impacts.
Under this award, GBIF will seek to transform the working prototype into a fully functional information facility. Under the reorganization of GBIF into two thematic areas, Informatics and Participation, the GBIF Work Programme for 2009-2010 (http://www2.gbif.org/WP2009-10.pdf) lays out a bold suite of advances toward a comprehensive global biodiversity information infrastructure that will support both science and policy decision making, and that will ensure that GBIF Informatics prioritizes development in accordance with the needs expressed via the Participation thematic area on behalf of participants. The proposed advances, based on concrete stakeholder demands and years of experience with the informatics infrastructure so far, will bring the GBIF information resources from prototype to full operation as "the primary World Wide Web source for all data and information about biodiversity," as laid out in the 2007-2011 GBIF Strategic Plan.
The award was made to the Global Biodiversity Information Facility (GBIF) to explore the technical and social constraints related to the sharing of biodiversity information and data. GBIF is a multilateral agreement among participating countries to develop shared infrastructure that supports the sharing of, and access to, information and data about the species inhabiting out planet. The initial data focus of the GBIF network has been on the mobilization and access of what is known as "primary" biodiversity. That is: 1. Species observations, such as those made by scientists and, increasingly, by members of the public and made accessible through an increasing number of online obsercation networks. 2. Natural history collections, representing by and estimate 3-4 billion preserved and labeled plant, animal and fossil specimens housed in the worlds natural history museums, herbaria and other collections. GBIF serves as a global index to over 10,000 databases that serve that data that is linked to these sorts of observations and specimens. The data have a wide range of uses from verifying the historical instance of species in places where they may no longer occur, to identifying biodiverse habitats as candidates for preservation to modeling food security under different climatic change model scenarios. The first generation of the GBIF network, prior to the award, drew upon technologies that were complicated to set up and manager. They also did not provide a very thorough set of details regarding the origin and, importantly, the individuals and institutions responsible for putting the data online. This award allowed us to test the hypothesis that a lower technical threshold and an enriched provenance profile would increase both the number and quality of data shared through the GBIF network. The work effort centered on a number of major areas of technical development. GBIF led an international effort to expand and refine the core data standards used as the basis for sharing biodiversity. This had a number of specific outcomes. The Darwin Core standard terms were extended and refined to support a wider array of biodiversity data products to be shared. GBIF led in the development of a significantly simplified data exchange file format, called the Darwin Core Archive. GBIF eliminated the need for cumbersome data access protocols for its indexing purposes, dramatically reducing indexing time. GBIF identified and extended an enriched dataset descriptive standard, the Ecological Markup Language (EML) to provide an enriched mechanism to describe datasets and their contributors. GBIF developed an Integrated Publishing Toolkit (IPT), a well-supported and vastly improved web application for sharing biodiversity data through the GBIF network. Earlier generation publishing tools were poorly supported, complicated to configure, and difficult to upgrade. The IPT has a dedicated open-source project team, a training program, and multi-lingual documentation. It's use has led to widespread adoption of the Darwin Core Archive data standard, which the IPT supports, and resulting performance benefits in the sharing and indexing of data. Lastly, GBIF overhauled the central data indexing portal. This included the registration process, which did not effectively represent the sometimes complex relationships between organizations, institutions and individuals that work together to collect, curate, and transform biodiversity data into an online resource. The updated register supports such complexity, assuaging concerns of such actors that they were not effectively credited for their role. In addition, the indexing and data validation processes were overhauled so they could be undertaken in a parallelized manner. This supports a complete turnover of the network data from all servers from the 1 month cited above to less than 48 hours and sets the ultimate limit for processing to the number of processers the system employed as opposed to earlier limits based on serialized processing. Results The results of this work were: 1. A significant increase in both the number of databases and the number of data records published through the GBIF network (120 million records from 7,000 databases at the start of the project to 397 million records from 10,085 databases. 2. A reduction in overall indexing latency from 35 days to 1.5 days (96% decrease in latency) 3. An improvement in data quality along a number of different measures. This includes: A. Percentage of scientific names not matched to authority files: From 37% to less than 3%. B. Inclusion of Exclusive economic zone data to better align offshore data records to their countries of originC. Improvement of duplicate data detection processes that resulted in more than a doubling in recall.D. A decrease in geospatial errors associated with transposition of data elements. Summary Lowering the technical threshold, increasing the visibility and recognition of contributors, and decreasing the overall latency of the network to respond to published data has significantly increased the capacity and quality of biodiversity data shared worldwide through the GBIF network.