The life science research community generates an abundance of data on genes, proteins, sequences, etc. These are captured in publicly available resources such as Entrez Gene, PDB and PubMed and in focused collections such as TAIR and OMIM. A number of ontologies such as GO, PO and UMLS are in use to increase interoperability. Records in these resources are typically annotated with controlled vocabulary (CV) terms from one or more ontologies. Records are often hyperlinked to those in other repositories, creating a richly curated biological Web of semantic knowledge.
The objective of this project is to develop tools to explore and mine this rich Web of annotated and hyperlinked entries so as to discover meaningful patterns. The approach builds upon finding potentially meaningful and novel associations between pairs of CV terms cross multiple ontologies. The bridge of associations across ontologies reflects annotation practices across repositories. A variety of graph data mining and network analysis techniques are being explored to find complex patterns of groups of CV terms cross multiple ontologies. The intent is to identify biologically meaningful associations that yield nuggets of actionable knowledge to be made available to the scientist together with a set of golden publications that support the identified patterns.
The intellectual merit of the project is that it is unique in comparison to other bioinformatics data integration and analysis projects. Data is integrated from across numerous sources including genes, gene annotations, ontologies, and the literature. The exploratory nature (EAGER) of this research is both with respect to the biological and the computer science disciplines. From the biological viewpoint, a high level of speculation is associated with any discovered biological patterns. Discovered patterns night not necessarily meet criteria for experimental validation. The research methodology combines algorithmic and analytical techniques from multiple computer science sub-disciplines. While specific technical innovations are expected, an inter-related set of computer science challenges needs to be defined. This research has the potential for broader impact since the methodology can be applied to any type of interlinked resources on the biological semantic Web as well as to any collection of hyperlinked resources. This research is a collaboration between the University of Maryland and the University of Iowa. For further information see the project web pages at the following URL: www.umiacs.umd.edu/research/CLIP/RSEAGER2009/
In this research, we focus on finding complex annotation patterns representing novel and interesting hypotheses from gene annotation data. Consider the model organism collection, The Arabidopsis Information Resource (TAIR). Data entries in TAIR are typically annotated with concepts or controlled vocabulary (CV) terms from the Gene Ontology (GO) and the Plant Ontology (PO), creating a rich Web of annotationknowledge. We define a generalization of the densest subgraph problem by adding an additional distance restriction (defined by a separate metric) to the nodes of the subgraph. We show that while this generalizationmakes the problem NP-hard for arbitrary metrics, when the metric comes from the distance metric of a tree, or an interval graph, the problem can be solved optimally in polynomial time. We also show that the densest subgraph problem with a specified subset of vertices that have to be included in the solution can be solved optimally in polynomial time. In addition, we consider other extensions when not just one solution needs to be found, but we wish to list all subgraphs of almost maximum density as well. We perform experiments to determine the properties of the dense subgraphs, as we vary parameters, including the number of genes and the distance. We applied the dense subgraph methodology to many sample datasets including a dataset of 10 photomorphogenesis genes. A user evaluation by our colleague Zhang confirms that the patterns found in the distance restricted densest subgraph for a dataset of photomorphogenesis genes are indeed validated in the literature. A control dataset of these 10 genes as well as 10 additional not involved in photomorphegenesis validatesthat these are not random patterns. Interestingly, the complex annotation patterns potentially lead to new and as yet unknown hypotheses. In parallel, we constructed a "ground truth" database of sentences from the literature that represent the imprint for gene GO and PO annotations.