In this big data era, massive data sequences are collected in various scientific fields for studying complicated phenomena over time and space, including neuroscience, epidemiology, social science, computer vision, and astronomy. Change-point analysis is a crucial early step in analyzing these data sequences, such as, to raise an alarm when an abnormal event happens in online data monitoring, and to segment a long sequence into more homogeneous parts for follow-up studies. To accommodate modern applications, the ability to deal with high throughput data and data with complicated structures is becoming a necessity. Parametric methods usually cannot be applied to very high dimensions unless strong assumptions are made to avoid the estimation of a large number of nuisance parameters. This project focuses on developing non-parametric change-point detection methods that are free of strong assumptions and computationally scalable to high dimensional and complex data. This project provides students and researchers with exciting new research problems that have both statistical and scientific importance. The training component for undergraduate and graduate students will prepare new researchers with inter-disciplinary education.
This project will develop a new scan statistic framework through a novel adaptation of graph-based methods. The PI has shown that the graph-based approaches scale to high-dimensional and non-Euclidean data, and allow for universal analytic permutation p-value approximations that is decoupled from application-specific modeling, facilitating their applications to large and complicated data sets. Despite the good properties of the graph-based methods, there are still some gaps between its current versions and many modern applications. This project aims to fill those important gaps. In particular, this project will (1) develop new graph-based approaches to effectively integrate information from multiple sources, which is common in many application areas, such as smart homes and smart cities, and seek ways to distribute the new approaches to local centers to avoid the excessive transmission of raw data in a distributed system; (2) develop treatments from the level of constructing the graph to deal with dependent data, which is more effective than a circular block permutation framework developed by the PI earlier; and (3) develop a new framework to provide analytic power approximations to the graph-based methods that kick in for sample sizes in hundreds and thousands even for high-dimensional data and non-Euclidean data, facilitating researchers to make better decisions in real applications. These methodological and theoretical developments will provide better understandings of modern complicated data sequences from diverse fields, which will further advance the understanding of major scientific problems in these fields. The tools developed in this project will be distributed as open source software packages with detailed documentations. This will enhance the collaboration between the statistics community and researchers from broader scientific fields, and make data analysis procedures more transparent.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.