Science, engineering, and industry are all being revolutionized by the modern era of data science, in which increasingly large and rich forms of data are now available. The applications are diverse and broadly significant, including data-driven discovery in astronomy, statistical machine learning approaches to drug design, and decision-making in robotics and automated driving, among many others. This grant supports research on techniques and models for learning from such massive datasets, leading to computationally efficient algorithms that can be scaled to the large problem instances encountered in practice. The PI plans to integrate research and education through the involvement of graduate students in the research, the inclusion of the research results in courses at UC Berkeley and in publicly available web-based course materials, as well as in mini courses at summer schools and workshops. This project will also provide mentoring and support for graduate students and postdocs who are female or belong to URM communities.
Many estimates in statistics are defined via an iterative algorithm applied to a data-dependent objective function (e.g., the EM algorithm for missing data and latent variable models; gradient-based methods and Newton's method for M-estimation; boosting algorithms used in non-parametric regression). This projectl gives several research thrusts that are centered around exploiting the dynamics of these algorithms in order to answer statistical questions, with applications to statistical parameter estimation; selection of the number of components in a mixture model; and optimal bias-variance trade-offs in non-parametric regression. In more detail, the aims of this project include (i) providing a general analysis of the EM algorithm for non-regular mixture models and related singular problems, in which very slow (sub-geometric) convergence is typically observed; (ii) developing a principled method for model selection based on the convergence rate of EM, and to prove theoretical guarantees on its performance; developing a general theoretical framework for combining the convergence rate of an algorithm with bounds on its (in)stability so as to establish bounds on the statistical estimation error; and (iii) providing a complete analysis of the full boosting path for various types of boosting updates, including kernel boosting, as well as gradient-boosted regression trees, and to analyze the "overfitting" regime, elucidating conditions under which overfitting does or does not occur.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.