This work will develop new methods for fast and scalable detection of anomalous patterns (subsets of the data that are interesting or unexpected) in massive, multivariate datasets. There will be a focus on real-world applications such as an emerging disease outbreak or a pattern of smuggling activity with complex, subtle, and probabilistic patterns that are difficult to spot with existing techniques. The research is based on two key insights. First, the pattern detection problem can be framed as a search over all subsets of the data, in which can be defined a measure of the "anomalousness" of a subset and then maximize this measure over all potentially relevant subsets.
Second, it has been discovered that, for many spatial detection methods (including Kulldor's spatial scan statistic and many recently proposed variants), one can perform an exact search which efficiently maximizes the measure of anomalousness over all subsets of the data. The research team will explore this new combinatorial optimization method, investigate how it can be extended to constrained subset scans and to more general multivariate pattern detection problems, and examine how it can be incorporated into a subset scan framework, enabling the creation a variety of fast, scalable, and useful methods for anomalous pattern detection.
Intellectual Merit The research team will develop, implement, and evaluate a general probabilistic framework for efficient detection of anomalous patterns in both spatial and non-spatial datasets. The proposed work will address these challenging and important research questions: 1)How can one define a useful measure of the "anomalousness" of a subset of the data, and efficiently optimize this measure over all subsets to find the most anomalous patterns? 2) What are the necessary and sufficient conditions for a set function F (S ) to satisfy the "linear- time subset scanning" (LTSS) property, enabling exact unconstrained optimization of F (S ) over all 2 N subsets of N records while only requiring O(N ) subsets to be evaluated? 3) How can one extend fast subset scanning methods to general multivariate datasets, and incorporate search constraints such as proximity, connectivity, and self-similarity? 4) How can one deal with uncertainty about the effects of an anomalous pattern by searching over subsets of "input" and "output" attributes as well as subsets of records?
Broader Impact Development and testing will be prioritized in three areas: 1) early detection of disease outbreaks, 2) detecting illicit container shipments, and 3) identifying anomalous trends in social networks. These applications will allow the demonstration the value of these methods across a wide spectrum of domains. Through existing collaborations, the algorithms will be incorporated into deployed systems for health and crime surveillance that contribute directly to the public good. The Principle Investigator's lab has over 5 years of history offering free machine learning software, and the software implementations of all algorithms developed through this grant will be made publicly available. The bulk of the funding will go to training graduate students who will become the next generation of researchers to explore new methods for anomalous pattern detection.
Key Words: anomalous patterns; pattern detection; fast subset scan; scan statistics; optimization.