Many of the most interesting and valuable discoveries that can be made from data arise not from the evaluation of single records, but from identifying a set of records that are anomalous in some interesting way. Together they may indicate for example the emergence of a disease outbreak or new patterns of criminal activity. One can view pattern discovery as an interactive process between data analysis algorithms and human users who have expertise in the domain. This research will develop an integrated framework of probabilistic methods to interact with the user in detecting, characterizing, explaining, and learning anomalous patterns over groups of records. The focus is on the many situations where the data (and the probabilistic patterns to be discovered) are not appropriate for using other existing techniques, such as graph mining or frequent sets. The proposed methods will search over arbitrary subsets of records and evaluate their correspondence to known, potentially very complex, probabilistic patterns, or their failure to match baseline data under various learned statistical models. These methods will assist the user in understanding and modeling the discovered, previously unknown anomalies to be identifiable as a known pattern when encountered in the future.

Intellectual Merit This collaborative team of researchers will develop, implement, and evaluate a general, comprehensive, and widely applicable probabilistic framework for pattern discovery. The proposed work will address these challenging and important research questions: - How can machine learning concepts such as classification and anomaly detection be generalized to consider groups of records rather than single records? - How can a detection algorithm simultaneously detect and differentiate between known and currently unknown pattern types? - How can an algorithm explain clearly to a user what pattern was found and why? - How can an algorithm learn new pattern types through feedback from a user?

The ability to detect, characterize, explain, and learn patterns from groups of records in massive datasets will provide a qualitatively new approach for advancing discovery of knowledge from data.

Broader Impact Although the applications for these algorithms are innumerable, development and testing will be prioritized in the areas of patient care in the intensive care unit (ICU) and aircraft fleet maintenance. Through the team's existing collaborations, the algorithms will also be used during the project in other areas including food safety, scientific discovery in astronomy sky surveys, and detection of geographic hot-spots of criminal activity. Together, these applications will demonstrate the methods' value across a wide spectrum of domains and tasks.

Key Words: anomalous patterns; pattern discovery; probabilistic models; incremental learning.

Project Report

Many of the most interesting and valuable discoveries that can be made from data arise not from the evaluation of single records, but from identifying a set of records that are anomalous in some interesting way. Together they may indicate for example the emergence of a disease outbreak or new criminal activity. In scientific domains, they may represent new phenomena waiting to be discovered. We have developed, implemented, and extensively evaluated a general, comprehensive, and widely applicable probabilistic framework for pattern discovery. Our work aimed to address challenging and important research questions, in particular: How can machine learning concepts such as classification and anomaly detection be generalized to consider groups of records rather than single records? How can a detection algorithm simultaneously detect and differentiate between known and currently unknown pattern types? How can an algorithm explain clearly to a user what pattern was found and why? How can an algorithm learn new pattern types through feedback from a user? The ability to detect, characterize, explain, and learn patterns from groups of records in massive datasets developed in the framework of this project, provides a qualitatively new capability for advancing discovery of knowledge from data. Through our collaborations with partners in academia, government, and industry, we have broadly demonstrated the utility of the framework in a range of practical applications, including for example: Public health: detection and characterization of emerging outbreaks of disease and modeling spread of contamination over water supply networks; Food safety: identification of patterns of linkage between foodborne disease in humans and risk of microbial contamination of food; Maintenance of fleets of equipment: discovery of statistically and pragmatically meaningful relationships between equipment maintenance activities and its reliability; Law enforcement practice and policy: mining publicly available Internet data for clusters of potential human trafficking activity; and modeling and predicting distributions of urban crime. Civil safety: advance prediction of unrest episodes such as strikes, protests, or riots, as well as predicting war atrocities; Astrophysics: identification of complex unknown patterns in large scale data for the purpose of scientific discovery; Medical informatics: discovery of breaks in billing patterns to improve accuracy of medical claim processing; Clinical medicine: detection and characterization of emerging episodes of cardio-respiratory instability in intensive care patients. We have also extensively tested our algorithms against a number of benchmark databases from the domains of text mining, image classification, multi-factor temporal data analysis, and others, involving active, semi-supervised, supervised, and unsupervised machine learning tasks. Our work has been widely disseminated. Project team members gave 116 public presentations of our work, including 39 invited talks, 6 of which were plenary keynotes at scientific conferences. We have published 5 book chapters, 45 journal articles, and 49 papers in scientific conference proceedings. A handful of additional publications are under peer review. This project has been an effective platform for development of human resources. It involved 23 doctoral students, 19 master students and undergraduates, and 6 research staff (including 2 individuals with doctoral degrees), supervised by 7 faculty members. Our work yielded 6 completed directly related to this project, as well as 2 Master’s and 1 undergraduate honor’s thesis. A few additional doctoral dissertations are in advanced progress and will be finalized in the next few months.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0911032
Program Officer
Frank Olken
Project Start
Project End
Budget Start
2009-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2009
Total Cost
$2,598,153
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213