Modern scientific disciplines are increasingly faced with datasets of ever larger size and complexity. Experimental observations may be marred by inaccurate measurements and missing values, and the sheer volume of the output of modern high-throughput experimental procedures in the life sciences makes data processing an increasing challenge. Drawing accurate scientific inferences from such data requires developing new tools that are both theoretically sound and computationally efficient. This project aims to develop statistical methodologies for uncovering the intrinsic structure in large, complex data. The planned methods have the potential to become the default data science techniques used in many scientific and engineering disciplines. Fast, user-friendly software will be made publicly available, both for general purpose big data analysis and specific scientific applications.

The first pillar of the planned methodology is principal component analysis (PCA). The investigators are extending the use of PCA to the setting of high-dimensional observations with corrupted observations, non-Gaussian noise, and low signal-to-noise ratios. These kinds of datasets arise in problems such as cryo-electron microscopy and X-ray free electron laser imaging. This work will provide robust tools for exploratory data analysis for these problems. The second pillar of the research program is the method of moments, a classical technique for parameter estimation that the investigators have repurposed for new problems. The investigators will extend the range of applicability of the method of moments to many big data problems that exhibit certain algebraic structure. For these problems, the method of moments enables scalable and near-optimal statistical inference. Finally, the novel extensions of PCA and the method of moments will be combined to derive new near-optimal and scalable statistical inference procedures for high-dimensional problems.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1837992
Program Officer
Victor Roytburd
Project Start
Project End
Budget Start
2018-10-01
Budget End
2021-09-30
Support Year
Fiscal Year
2018
Total Cost
$1,000,000
Indirect Cost
Name
Princeton University
Department
Type
DUNS #
City
Princeton
State
NJ
Country
United States
Zip Code
08544