Detection of threats is critical to national and global security. With tremendous amounts of surveillance and other types of data currently available, a systematic and formal quantitative approach to threat detection is needed. Threats are often preceded by abnormal behavior; early threat detection thus becomes detection of abnormal behavior. Statistically, detection of abnormal behavior is essentially detection of outliers from a "usual" distribution or a "usual" relationship. While missing a threat may have devastating impacts to society, a false detection also has negative impacts. This project aims to develop a new outlier detection framework under which the confidence level of detection is formulated in terms of the level of false positives and is precisely determined to aid decision makers for more informed resource planning. The development provides novel statistical approaches for threat detection utilizing a wide range of data sources. This outlier detection method and the confidence level determination tool are general statistical methods that form a useful framework for many threat detection and risk assessment problems. They will enrich the theory and methodology of statistics, produce a new statistical analytical toolkit, and contribute to general data science, since outlier identification is a crucial stage of data cleaning for valid downstream analysis. The investigators will actively engage in activities related to education and research training of graduate and undergraduate students, especially attracting minority and women students into the fields of statistics and statistical applications, and introducing them to areas that are important to global and national security.
Although outlier detection algorithms have been extensively studied, most existing methods do not provide an uncertainty assessment and rely on an ad hoc rule to make judgment calls. The ?Conformity Outlier Detection? (COD) framework under development in this project can overcome the shortcoming and provide detection with a theoretically guaranteed confidence level. This development is based on a state-of-art non-parametric predictive inference tool in machine learning and statistics, known as conformal prediction. It can provide accurate assessment of risk and uncertainty with little assumption on the data and can be applied broadly. Under this new COD framework, the project explores two outlier detection procedures. The first is distribution-free and is suitable for any data set that provides pairwise similarity measures between subjects. It can be used for outlier detection of a broad class of unconventional data sets often encountered in counter-terrorism surveillance (e.g. text data with word-use frequency similarity measure, communication pattern changes, voice similarities, network changes, and many others). The second is a model-based procedure to detect an abnormal deviation from a "usual" relationship or behavior. The detection method is robust against model misspecification under certain settings. In addition, since heterogeneity is commonly seen in large data sets, the project includes extension of the COD procedures to precision contextual outlier detection under the general individualized learning framework. Lastly, the project aims to demonstrate the approaches in a specific setting with sparely observed spatial and temporal count processes that are commonly encountered in surveillance of remote areas. This project will support one graduate student per year.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.