With the advent of Internet, numerous applications in the context of network traffic, search, and databases are faced with very large, inherently high-dimensional, or naturally streaming datasets. To effectively tackle these extremely large-scale practical problems (e.g., building statistical models from massive data, real-time network traffic monitoring and anomaly detection), methods based on statistics and probability have become increasingly popular. This proposal aims at developing theoretical, well-grounded statistical methods for massive data based on random projections, including data stream algorithms, quantized projection algorithms, and sparse projection algorithms.
Massive data are often generated as high-rate streams. Network traffic is a typical example. Effective measurements (and updates) of network traffic in real-time using small storage space are crucial for detecting anomaly events, for example the DDoS (Distributed Denial of Service) attacks. For many applications such as databases and machine learning, appropriate quantization of random projections will substantially improve the accuracies (in terms of variance per bit) and provide efficient indexing and dimension reductions to facilitate efficient search and learning. The proposed research will tackle a series of mathematically challenging problems in the development of random projections. A wide range of statistical learning and numerical linear algebra algorithms will be re-engineered to take advantage of the state-the-art projection methods.
These days, many industries such as search are in urgent demand for statistical algorithms which can effectively handle massive data. It is expected that algorithms to be developed in this proposal will be integrated with parallel platforms, to solve truly large-scale real-world problems. Research results will be disseminated to practitioners through publications, conference presentations, industry visits and collaborations, tutorials, and open-source distributions. Many of the proposed research problems involve statistical analysis and may continue to help attract statisticians/mathematicians to work on area of big data. The proposed research activities will engage both undergraduate and graduate students in statistics and engineering, through innovative curriculum and research training.