Scientific data repositories increasingly involve large amounts of images and streams of empirical measurements generated by a diverse set of data sources. The goal of this project is to develop online structures and algorithms to dynamically maintain and analyze data sequences for scientific discovery and monitoring purposes. The implementation focuses on specific applications from physical and biological sciences that generate vast amounts of multi-dimensional data sequences. For scientific discoveries, an iterative querying framework is developed for modeling of the sequences of observations. The framework optimally utilizes access structures to execute queries ranging from a simple max aggregate to complex scientific queries. Interactive tools are implemented where researchers are able to incorporate domain specific knowledge into the search process. For real-time monitoring, one-pass summaries that can be updated in constant-time are developed. The structures are designed to be self-adaptive with respect to the workload changes and to handle heterogeneous and incomplete information. The project involves collaborations with domain experts in focus areas and is expected to advance the state-of-the-art knowledge in the application domains. For example, the gene expression analysis tools implemented in this project have already enhanced the ability of the collaborative researchers in their studies of Haemophilus Influenzae (first described in 1892 by Dr. Richard Pfeiffer during an influenza pandemic) in order to understand it role in a wide range of clinical diseases, so that effective vaccines can be developed. This research project is integrated with education through significant educational and outreach activities. The developed toolkits, findings, and methods of the project will be communicated in a broader context and to an expanded audience through the project website (www.cse.ohio-state.edu/~hakan/Career.html).

Project Report

The project resulted in novel technologies to explore large amounts of data produced by a diverse set of data applications. The developed methods include management of multi-dimensional data streams, novel querying and analytics support, online structures, and analysis of biological data. The new structures and algorithms are shown to dynamically maintain and analyze large-scale data and multiple streams that can be incrementally updated. The research is integrated with education through significant educational and outreach activities. The developed software and outcomes are communicated to both scientific community and end-users. We developed a framework that utilizes indexing methods and execute a wide variety of queries. Model-based optimization queries are executed by a generic model to define a wide variety of queries involving an optimization objective function and a set of constraints on the attributes. We achieved nearly identical performance to the limited optimization query types with optimal solutions, while providing generic modeling and processing for a much broader class of queries, and while effectively handling problem constraints. The proposed framework offers I/O-optimal access of whatever access structure is used for the query. In the project, we introduced a parameterizable technique to recommend indexes based on index types frequently used for high-dimensional data sets and to dynamically adjust indexes as the underlying query workload changes. We incorporated a query pattern change detection mechanism to determine when the access patterns have changed enough to warrant change in the physical database design. We introduced a method to sumarize a sliding window of most recent entries in a one-pass fashion. The correlation between consecutive data elements are effectively taken into account without the need of any pre-processing. Queries on any subsequence of a sliding window over multiple streams are processed efficiently. The project also included a microarray data analytics case study for grouping genes using multiple ?distance measures. Each measure captured a particular similarity view such as shifted relationships, negative correlations and strong positive relationships. The effectiveness of the algorithm is demonstrated on multiple microarray data sets. The framework allowed merging of results from different datasets obtained from different clustering algorithms using different metrics. Different combination of set operations revealed different kinds of interactions between genes. The technique identified co-regulated genes, operons and regulons, based on microarray time-series data.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0546713
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2006-08-01
Budget End
2011-07-31
Support Year
Fiscal Year
2005
Total Cost
$455,000
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210