This project develops a new generation of optimization methods to address data mining and knowledge discovery challenges in large-scale scientific data analysis. The project is constructed in the context that modern computing architectures are enabling us to fit complex statistical models (Big Models) on large and complex datasets (Big Data). However, despite significant progress in each subfield of Big Data, Big Model, and modern computing architecture, we are still lacking powerful optimization techniques to effectively integrate these key components.
One important bottleneck is that many general-purpose optimization methods are not specifically designed for statistical learning problems. Even some of them are tailored to utilize specific problem structures, they have not actually incorporated sophisticated statistical thinking into algorithm design and analysis. To tackle this bottleneck, the project extends traditional theory to open new possibilities for nontraditional optimization problems, such as nonconvex and infinite-dimensional examples. The project develops deeper theoretical understanding of several challenging issues in optimization (such as nonconvexity), develops new algorithms that will lead to better practical methods in the big data era, and demonstrates the new methods on challenging bio-informatics problems.
The project is closely related to NSF's mission to promote Big Data research, and will have broad impacts. In the Big Data era, we see an urgent need for powerful optimization methods to handle the increasing complexity of modern datasets. However, we still lack adequate methods, theory, and computational techniques. By simultaneously addressing these aspects, this project will deliver novel and useful statistical optimization methods that benefit all relevant scientific areas. The project will deliver easy-to-use software packages which directly help scientists to explore and analyze complex datasets. Both PIs will also design and develop new classes to teach modern techniques in handling big data optimization problems. All the course materials - including lecture notes, problem sets, source code, solutions and working examples - will be freely accessed online. Moreover, both PIs will write tutorial papers and disseminate the results of this research through the internet, academic conferences, workshops, and journals. Through senior theses and potentially the REU (Research Experiences for Undergraduates) program, the proposed project will also actively include undergraduates and engage under-represented minority groups.
To achieve these goals, this project develops (i) a new research area named statistical optimization, which incorporates sophisticated statistical thinking into modern optimization, and will effectively bridge machine learning, statistics, optimization, and stochastic analysis; (ii) new theoretical frameworks and computational methods for nonconvex and infinite-dimensional optimization, which will motivate effective optimization methods with theoretical guarantees that are applicable to a wide variety of prominent statistical models; (iii) new scalable optimization methods, which aim at fully harnessing the horsepower of modern large-scale distributed computing infrastructure. The project will shed new theoretical light on large-scale optimization, advance practice through novel algorithms and software, and demonstrate the methods on challenging bio-informatics problems.