A number of scientific endeavors generate data that can be modeled as graphs: high-throughput biological experiments on protein interactions, high throughput screening of chemical compounds, social networks, ecological networks and food-webs, database schemas and ontologies. Access and analysis of the resulting annotated and probabilistic graphs are crucial for advancing the state of scientific research, accurate modeling and analysis of existing systems, and engineering of new systems. This project aims to develop a set of scalable querying and mining tools for graph databases by integrating techniques from databases and data mining. The proposed research work is theoretical as well as empirical. New theoretical ideas and algorithms are being developed and these are being applied to the domains of Cheminformatics and Bioinformatics.

The first research thrust examines primitives for graph data management and graph mining. A declarative query language for graphs is being investigated. This language is based on a formal language for graphs and a graph algebra, and separates the concerns of specification and implementation. Scalability of techniques for similarity search on graphs and mining for significant patterns is being investigated as a part of this thrust.

The second research thrust applies the developed techniques to the domain of Cheminformatics. Specific tasks that are being examined are search for similar compounds, mining for significant motifs, diversity analysis, and analysis of macromolecular complexes.

The final research thrust applies the developed methods to the domain of Bioinformatics. There has been an explosion of data of widely diverse biological data types, arising from genome-wide characterization of transcriptional profiles, protein-protein interactions, genomic structure, genetic phenotype, gene interactions, gene expression, proteomics, and other techniques. Techniques being developed can integrate and analyze data from multiple sources and models efficiently, while accelerating (interaction and function) prediction, and pathway discovery.

Further information about the project can be found at the project web page www.cs.ucsb.edu/~dbl/0917149.php.

Project Report

A number of scientific endeavors generate data that can be modeled as graphs: high-throughput biological experiments on protein interactions, high throughput screening of chemical compounds, social networks, ecological networks and food-webs, database schemas and ontologies. Access and analysis of the resulting annotated and probabilistic graphs are crucial for advancing the state of scientific research, accurate modeling and analysis of existing systems, and engineering of new systems. This project developed a set of scalable querying and mining tools for graph databases by integrating techniques from databases and data mining. The research work was theoretical as well as empirical. New theoretical ideas and algorithms were developed and these were applied to the domains of Cheminformatics and Bioinformatics. We worked on the following two specific problems. 1. Analysis of global state networks: Global-state networks provide a powerful mechanism to model the increasing heterogeneity in data generated by current systems. Such a network comprises a series of network snapshots with dynamic local states at nodes, and a global network state indicating the occurrence of an event. These networks arise in biology (pathways implicated in a disease), learning (brain regions activated in a learning task), and social networks (sentiments of users). 2. Top-k representative queries on graph databases: We investigated the problem of top-k representative queries on graph databases. Such queries are useful when a user wants to obtain a quick summary of a large collection of graphs based on his/her definition of relevance.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0917149
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2009-09-15
Budget End
2013-08-31
Support Year
Fiscal Year
2009
Total Cost
$509,261
Indirect Cost
Name
University of California Santa Barbara
Department
Type
DUNS #
City
Santa Barbara
State
CA
Country
United States
Zip Code
93106