The wealth of biological and biomedical data constantly being generated promises dramatic advancement in the life sciences. To realize this promise, this pool of rapidly expanding information needs to be efficiently integrated, that is, combined in such a way that it can be queried to extract relevant data that can be subsequently analyzed to answer meaningful research questions. The main objective of this proposal is to develop the GeneTegra System, an information integration solution that provides a common interaction environment to query data and knowledge from multiple sources. Two main obstacles have to be overcome in order to attain an effective integration of knowledge from different data sources: syntactic heterogeneity, where data sources have different representation and access mechanisms;and semantic variability, where similar lexical terms may refer to multiple concepts or dissimilar terms refer to the same concept. The GeneTegra System addresses these obstacles through the use of Semantic Web technologies: ontologies constructed using the Web Ontology Language (OWL) as a common data and knowledge representation for data sources of diverse formats, automated mechanisms for the generation and maintenance of these ontology representations, and a robust system architecture based on reusable, service-oriented mediators. The core of the proposed system consists of general algorithms, procedures, and mechanisms developed during Phase I of this project, that enable the automatic generation of ontologies, the automated identification of semantic correspondences between ontology models, and the creation and execution of queries over these ontology- modeled, distributed, heterogeneous sources. In Phase II, the GeneTegra System will be developed, implemented, and tested as a human-centered solution building on the core components developed during Phase I, incorporating a highly usable interface for query creation and execution, a mechanism for registration, sharing, and re-use of information using Web Services standards, a mechanism for determining quality of data and query reliability, and a security and privacy subsystem that allows the construction of collaborative communities while ensuring that users are properly authenticated and authorized to access information through the system. The GeneTegra System will be designed and evaluated to specifically address the integration of sources relevant to investigations of genotype-phenotype associations and to the identification of genes responsible for human diseases and conditions.

Public Health Relevance

The GeneTegra System is an information integration solution that provides a common interaction environment to query data and knowledge from multiple heterogeneous sources. It uses ontologies as the base formulism for semantic and syntactic modeling, and contains automated mechanisms for the generation of these ontologies, and for the reuse and sharing of integration configurations. It is specifically designed to address the integrated querying of sources relevant to investigations of genotype-phenotype associations and to the identification of genes responsible for human diseases and conditions.

Agency
National Institute of Health (NIH)
Institute
National Center for Research Resources (NCRR)
Type
Small Business Innovation Research Grants (SBIR) - Phase II (R44)
Project #
5R44RR018667-04
Application #
7614360
Study Section
Biomedical Computing and Health Informatics Study Section (BCHI)
Program Officer
Brazhnik, Olga
Project Start
2003-07-01
Project End
2011-03-31
Budget Start
2009-04-01
Budget End
2010-03-31
Support Year
4
Fiscal Year
2009
Total Cost
$505,764
Indirect Cost
Name
Infotech Soft, Inc.
Department
Type
DUNS #
035354070
City
Miami
State
FL
Country
United States
Zip Code
33131
Jean-Mary, Yves R; Shironoshita, E Patrick; Kabuka, Mansur R (2009) Ontology Matching with Semantic Verification. Web Semant 7:235-251
Shironoshita, E Patrick; Jean-Mary, Yves R; Bradley, Ray M et al. (2009) semQA: SPARQL with Idempotent Disjunction. IEEE Trans Knowl Data Eng 21:401-414