Massive data present unprecedented opportunities for advancing our understanding of various scientific and social phenomena. With sufficient data and the appropriate statistical tools, researchers can now hope to recover structures in the data that were once deemed too intricate to identify with traditional "small" data. Extracting complex hidden structures in massive data often requires flexible nonparametric methods; however, there are several fundamental challenges that make existing nonparametric methods impractical or inadequate. At the core of these challenges is a conflict between two essential aspects in big data analysis: (i) the need for flexible methodology for capturing complex features and (ii) the cost, both computational and statistical, associated with this additional flexibility. Effective resolution of this fundamental conflict requires new paradigms of nonparametric inference. The long-term research objective of this project is to develop inference paradigms, including theory, methods, algorithms, and software, for nonparametric inference and learning that effectively resolve this fundamental conflict. The research will lead to the development of statistical tools that meet urgent needs for scalable nonparametric data analysis in a wide range of fields, including biology, economics, astrophysics, chemistry, and information technology. The project will address the integration of research with educational activities through teaching and mentoring of undergraduate and graduate students, and outreach to students from local colleges.
This project will develop and investigate a particularly promising paradigm, multi-scale divide-and-conquer, to address the fundamental conflict between flexibility and cost. Specific inference problems to be addressed cover a wide range of nonparametric inference and learning objectives, and can be organized into three research thrusts: (i) joint nonparametric modeling of multiple data generative processes; (ii) characterizing dependency between random variables/vectors; and (iii) response-domain ensemble supervised learning. Beyond addressing these specific objectives, the proposed research will introduce theoretical and computational devices for evaluating and improving the statistical and computational efficiency of multi-scale divide-and-conquer methods in general. The output of the research will include practical methods and algorithms for carrying out a variety of important nonparametric inference tasks on massive data, as well as general guiding principles for effective multi-scale statistical analysis. The research output will be disseminated through publications, presentations, and open-source software to the scientific community, and society at large.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.