Sawflies, ants, bees, wasps (Hymenoptera) comprise an extraordinarily diverse lineage of insects, with more than 115,000 described species and likely 1,000,000 species yet to be discovered. These insects serve critical roles as pollinators, parasitoids, herbivores, and as models for research on social behavior, physiology, speciation genetics, and to explore parasite-host interactions. The vast anatomical diversity exhibited by these organisms, coupled with a large body of disparate research and the eccentricities of investigators, yields numerous concurrent and only partially overlapping vocabularies that describe Hymenoptera anatomy. Centuries of Hymenoptera research, therefore, remain clouded by inconsistent terminology (for example, 'annellus' is used for two different head structures and for a part of the male genitalia).

This project will bring Hymenoptera researchers together to build a consensus structured vocabulary (the Hymenoptera Anatomy Ontology) that 1) enables discovery of research results from publications, 2) empowers taxonomists to efficiently describe/ diagnose species and 3) provide improved access to information for policy makers, farmers, land managers and the general public. Tools will be developed that allow collaborators to virtually build ontologies for any group of species, while making these data useful to the research community via a Web-based anatomical atlas and application programming interface; all software will be open source (http://purl.oclc.org/NET/hymontology).

Three postdocs and three students will receive training in an emerging field: ontologies in evolutionary biology.

Project Report

Taxonomists are arguably the most active annotators of the natural world, collecting and publishing millions of phenotype data annually through descriptions of tens of thousands of species. Nature’s incredible phenotypic diversity provides unlimited source of potential models that help to understanding biological processes underlying human diseases or inspire advances in human technology. Unfortunately phenotypes are currently published in natural language (free text), with no standard syntax nor standard vocabulary and thus are essentially hidden from researchers who need the data. For example, a branched arthropod hair could be described in a variety of ways, all with roughly the same meaning: hairs plumose vs. setae forked vs. sensilla ramose vs. cuticular processes pronged. For a researcher seeking to better understand the basics of branching morphogenesis - biological processes that lead to the development of human circulatory, respiratory or nervous systems, for example - or to develop new adhesive materials, the only way to harvest nature’s rich source of branching phenotypes is to read an impossibly large number of papers. Another option, which was the focus of this project, is to develop a way that phenotype data could be represented in a structured, standard way that facilitates computation, much like what can be done already with DNA data. If phenotype data are collected and published in semantically enhanced way they become available for addressing questions beyond simply "what species if this?" Intellectual Merit.—The project funded by this award was geared primarily to developing a new, semantic approach (that is, meaningful to both computers and to humans) to describing species and representing their phenotypes. Phase 1 was to assemble a referable knowledge base of anatomical concepts for the insect order Hymenoptera (includes sawflies, wasps, bees, and ants), chosen because of its rich diversity (>150,000 known species, with perhaps at least 1,000,000 awaiting description), its economic relevance (>$17 Billion in natural control of insects, >$150 Billion in pollinator services), and because it encompasses two important model organisms, the honey bee (Apis mellifera) and Nasonia species. The resulting knowledge base was an ontology. That is, a formal representation of concepts in a domain (in this case structured, standardized definitions of hymenopteran body parts, accompanied by annotated illustrations) and the relationships, like "is a" and "part of", between those concepts (for example, antenna is_a appendage and antenna part_of the head). Phase 2 was to develop a standard syntax and semantic model for describing hymenopteran phenotypes that is understood by computers and humans alike. Several character description templates were developed using a common standard, Web Ontology Language (OWL), that allow for the incorporation of anatomical concepts and phenotype descriptors, like colors, shapes, etc. (from a different ontology, developed for model organisms). Several issues concerning data longevity, relevance, and difficult-to-describe phenotypes had to be solved for this approach to be broadly applicable and sustainable. Datasets were then generated that demonstrated the utility of semantic, computable phenotypes by allowing researchers to ask questions of the data that required logic to answer. The development of these datasets also revealed that a few, relatively simple changes in a typical taxonomist’s workflow could allow for the generation of semantic phenotype data as a normal product of revisionary taxonomy. Broader Impacts.—The resulting knowledge base was made available as an online source of hymenopteran anatomy, and a Web-based tool was developed that allows anyone to extract anatomical concepts from free text. The extracted terms can then be matched to concepts in the ontology, and the software will generate an appendix (a glossary with links to definitions for each term). This tool is being used by an increasing number of hymenopterists, making their publications more accessible for non-specialists. Three graduate students, three postdocs, and several undergraduates were trained in biodiversity informatics, and there were several activities designed to educate professional scientists about the potential of semantic phenotype data. The project also resulted in several high profile publications that highlight issues with the way that phenotype data are typically generated and that there are now mechanisms to make phenotype data more broadly available. The project inspired numerous other arthropod research communities to develop ontologies (for example people who study beetles are developing a similar resource, using the technology developed for this NSF project). In this way the HAO project catalyzed a new process in arthropod taxonomy: standardizing the way that phenotype data are represented.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
1321620
Program Officer
Anne Haake
Project Start
Project End
Budget Start
2012-06-30
Budget End
2014-03-31
Support Year
Fiscal Year
2013
Total Cost
$246,183
Indirect Cost
Name
Pennsylvania State University
Department
Type
DUNS #
City
University Park
State
PA
Country
United States
Zip Code
16802