One of the greatest challenges in modern data analysis is to identify subtle, complex anomalous patterns (subsets of a data set that are novel or unexpected) within ubiquitous multi-modal, heterogeneous, and high-dimensional multi-source data sets in the current big data era. The detection of such salient patterns is an indispensable tool for knowledge mining and discovery in important applications across many fields of science, engineering, and business, including the early detection of infectious disease outbreaks, crime hotspots, network intrusions, false advertising, cyber botnets, customer activity monitoring and user profiling, and fraudulent medical claims, among others. The project research goal is to develop a new and innovative paradigm for discovering complex and subtle anomalous patterns in ubiquitous multi-modal, heterogeneous, and high-dimensional multi-source datasets in the current big data era. The key idea is to generalize the idea of meta-analysis from the statistical community and to reframe the problem as a search over all subsets of nonparametric statistical tests that are conducted on individual record-level features, in order to find the subsets (anomalous patterns) that are jointly significant. The project is focused on real-world problems related to biosurveillance and cybersecurity with two challenging applications: early detection of rare and infectious disease outbreaks (e.g., foodborne, Hantavirus, yellow fever) and Sybil attacks (e.g., spammers, fake users, and compromised normal users). The integrated education plan includes the development of new courses offered at the Master of Science program in Computer Science and Informatics and outreach to underrepresented groups. The outcomes of this project will be widely disseminated to broader audience via tutorials and workshops.

The research objectives of this project are: (1) the development of nonparametric tests for modeling anomalous information of multi-source datasets; (2) learning heterogeneous dependencies among nonparametric tests; (3) detecting anomalous patterns from an extremely large set of nonparametric tests; and (4) making the detected anomalous patterns interpretable in the context of multi-modal, heterogeneous, and high-dimensional data. The research approach includes the development of (1) nonparametric tests on individual record level features that provide consistent representations of anomalous information from multiple heterogeneous input modalities, such as image, text, video, and multiple sensor streams; (2) deep structured and adversarial methods capable of learning robust hierarchical dependency structures of nonparametric tests using unlabeled training data; (3) fast, scalable combinatorial optimization methods capable of accurately detecting salient anomalous patterns from billion-size nonparametric tests; and (4) transparent and interpretable methods capable of explaining the predicted anomalous patterns by identifying training instances and features that are most responsible for the predictions.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2018-09-01
Budget End
2019-11-30
Support Year
Fiscal Year
2018
Total Cost
$249,989
Indirect Cost
Name
Suny at Albany
Department
Type
DUNS #
City
Albany
State
NY
Country
United States
Zip Code
12222