With data mining techniques it is possible to train accurate prediction models for large high-dimensional data. Unfortunately, complex prediction models per se are not easy to understand. To make them 'digestible', analysts need simpler patterns that summarize the complex functions extracted by the model. The number of such function summaries is overwhelming. Each slice of a lower-dimensional subspace of the original data space could contain an interesting function summary.

The goal of this project is to develop techniques for finding the most 'interesting' function summaries automatically and efficiently. This is done in three steps. First, by formalizing the notion of interestingness for a wide variety of pattern types. Second, by developing a declarative language for specifying these interestingness measures. With a declarative language analysts define what they find interesting, but they need not specify how to find it efficiently. Third, an optimizing compiler for a small language fragment handles the performance efficiency. A major research challenge is to strike the right balance between expressiveness of the language and making it amenable to effective query optimization.

The results of this project will pave the way for powerful exploratory analysis tools. They will also enable future research on optimizers and user-friendly interfaces for the declarative language. The approach will be validated using the rich data resources being organized by the ornithological community in the Avian Knowledge Network (AKN). This will have a tremendous impact on the ability to identify the most significant environmental variables that affect biodiversity on the planet. For example, land managers could discover the possible impact of their decisions on an ecosystem's health.

A component of the language will be available to the public through Web services on the AKN Web site (www.avianknowledge.net/content). Additional results will be disseminated through the project Web site (www.cs.cornell.edu/~mirek/Projects/FunctionSummaries). This will enable a broad audience, from researchers to land managers or bird watchers, teachers or school children to derive novel knowledge from the data resources gathered.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0748626
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2007-09-15
Budget End
2009-09-30
Support Year
Fiscal Year
2007
Total Cost
$200,000
Indirect Cost
Name
Cornell University
Department
Type
DUNS #
City
Ithaca
State
NY
Country
United States
Zip Code
14850