We now live in the era of data deluge. The sheer volume of the data to be processed, together with the growing complexity of statistical models and the increasingly distributed nature of the data sources, creates new challenges to modern statistics theory. Standard machine learning methods are no longer able to accommodate the computational requirements. They need to be re-designed or adapted, which calls for a new generation of design and theory of scalable learning algorithms for massive data. This project aims to provide a collection of state-of-the-art nonparametric learning tools for big data analysis, which can be directly used by scientists and practitioners and have beneficial impacts on various fields such as biomedicine, health-care, defense and security, and information technology. The deliverables of this project include easy-to-use software packages that will be thoroughly evaluated using a range of application examples. They will directly help scientists to explore and analyze complex data sets.
Due to storage and computational bottlenecks, traditional statistical inferential procedures originally designed for a single machine are no longer applicable to modern large datasets. This project aims to design new scalable learning algorithms of wide-ranging nonparametric models for data that are distributed across a large number of multi-core computational nodes, or in a fashion of random sketching if only a single machine is available. The computational limits of these new algorithms will be examined from a statistical perspective. For example, in the divide-and-conquer setup, the number of deployed machines can be viewed as a simple proxy for computing cost. The project aims to establish a sharp upper bound for this number: when the number is below this bound, statistical optimality (in terms of nonparametric estimation or testing) is achievable; otherwise, statistical optimality becomes impossible. Related questions will also be addressed in the randomized sketching method in terms of the minimal number of random projections.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.