Association analysis finds patterns that describe the relationships among the binary attributes (variables) used to characterize a set of objects. A key strength of association pattern mining is that the potentially exponential nature of the search space can often be made tractable by using support based pruning of patterns i.e., eliminating patterns supported by too few transactions. Despite the well-developed theoretical foundation of association mining, this group of techniques is not widely used as a data analysis tool in many scientific domains. For example, in the domain of bioinformatics and computational biology, while the use of clustering and classification techniques is common, techniques from association analysis are rarely employed. This is because many of the patterns required in bioinformatics and other domains are not effectively captured by the traditional association analysis framework and its current extensions. Although such patterns can be found by techniques such as bi-clustering and co-clustering, these approaches suffer from a number of serious limitations, most notably, an inability to efficiently explore the search space without resorting to heuristic approaches that compromise the completeness of the search. To address the challenges mentioned above, the team will extend the traditional association analysis framework. They propose two novel frameworks for directly mining patterns from real-valued data that, unlike biclustering and co-clustering, are able to discover all patterns satisfying the given constraints and do not suffer from the loss of information caused by discretization and other data transformation approaches. They will also extend association analysis based approaches to work with data that has class labels by effectively using the available class label information for pruning the exponential search space and finding low-support patterns that discriminate between the two data classes. To evaluate the results of the work, they will develop robust evaluation methodologies for evaluating the patterns obtained from the proposed frameworks. The proposed work promises to extend the power of association analysis to a wide range of new applications in health and life sciences, such as the discovery of biomarkers and functional modules from single nucleotide polymorphism and gene expression data, with potential applications in personalized medicine and the development of drugs and bio-fuels.

Project Report

The area of data mining known as association analysis seeks to find patterns that describe the relationships among the attributes (variables) used to describe a set of objects. The most common example is market basket data; where the objects are sales transactions consisting of sets of store items purchased by a customer, and the attributes are binary variables that indicate whether or not an item was purchased by a particular customer. The patterns are either sets of items that are frequently purchased together or rules that capture the fact that the purchase of one set of items often implies the purchase of a second set of items. More generally, association analysis can be used for many important practical applications. For example, it can be used to find biomarkers, which are patterns in biomedical data that identify a group of people with common characteristics, such as a set of genetic variations that make them more susceptible to a particular disease. Despite the potential intellectual benefits of association pattern discovery and its various applications, this group of techniques is not widely used as a data analysis tool in most scientific and commercial domains. The reason is that most association analysis techniques are targeted for data that contains only binary (yes/no, present/absent) variables and is used mostly for descriptive instead of predictive data analysis. We have addressed this issue by advancing the state of the art in association analysis with important applications to biological data. A significant part of our work has dealt with finding biomarkers in biomedical data. As an example, one such technique can find patterns in gene expression data can differentiate healthy and diseased (cancer) patients. In general, the biomarkers discovered by our techniques can be used for detection, prediction, and to further our basic understanding of disease. Complete details can be found in the associated papers and code produced by our project.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0916439
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2009-08-01
Budget End
2012-11-30
Support Year
Fiscal Year
2009
Total Cost
$547,996
Indirect Cost
Name
University of Minnesota Twin Cities
Department
Type
DUNS #
City
Minneapolis
State
MN
Country
United States
Zip Code
55455