To capitalize on the transformative opportunities of the increasingly large amounts of digital data produced by the biological research community, we need to systematically adopt data and metadata standards, such as the Gene Ontology (GO). Because of GO?s fundamental role in codifying, managing, and sharing biological knowledge, quality issues, if not addressed, can cause misleading results or missed biological discoveries. Enhancing the quality of ontological systems such as GO, though a challenging and arduous task, can directly impact the very foundation of data-intensive research discovery. Most existing quality assurance approaches for GO have focused on the enrichment of concepts in order to keep pace with the rapidly evolving biological knowledge. However, critical structural information represented by relations has been largely ignored in existing quality assurance approaches, making them inadequate for their intended roles. Principled, scalable, and automated approaches that can debug GO to generate programmable (rather than manual) suggestions, if successful, can be a game changer in developing a new generation of methods for enhancing the quality of GO. The PI proposes a Subsumption-based Sub-term Inference Framework, SSIF, for auditing the GO by leveraging both its underlying graph structure and a novel term-algebra. SSIF combines the biological knowledge embedded in terms, sub-terms, and relationships captured in GO that can automatically detect semantic inconsistencies and generate change suggestions for future versions of GO.

In order to enhance the quality of the Gene Ontology and other biomedical ontologies, the PI proposes development of a Subsumption-based Sub-term Inference Framework, SSIF. The SSIF includes three main components: (1) a sequence-based representation of GO concept terms by using part-of-speech parsing and sub-concept matching; (2) the formulation of algebraic operations for the development of a term-algebra combining this sequence-based representation with antonyms and subsumption-based longest subsequence alignment; and (3) the construction of a set of conditional rules for backward subsumption inference aimed at uncovering semantic inconsistencies in GO and other ontological structures. SSIF will be implemented using scalable computational algorithms and applied to the GO distributions provided by the Gene Ontology Consortium. Two algorithmic strategies will be explored to perform large-scale, backward subsumption inference on GO using the conditional rules: (1) exhaustive, all concept pairs, and (2) the subspace of concept pairs within a special type of induced substructures called non-lattice subgraphs. If an existing relation in GO is inconsistent with the consequence of the conditional rules, it represents a likely candidate of error. The uncovered semantic inconsistencies based on a collection of conditional rules have the potential to automatically reveal local ?bugs? as well as potential systemic patterns for review and revision, to enhance the quality of GO and other biomedical ontologies.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1657306
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2017-03-01
Budget End
2019-02-28
Support Year
Fiscal Year
2016
Total Cost
$150,981
Indirect Cost
Name
University of Kentucky
Department
Type
DUNS #
City
Lexington
State
KY
Country
United States
Zip Code
40526