Extraordinary amounts of data are collected in many branches of science, in industry, and in government. The massive amounts of data provide incredible opportunities for making knowledge-based decisions and for advancing complicated research problems through data-driven discoveries. To capitalize on these opportunities, it is critical to develop methodology that facilitates the extraction of useful information from massive data in a computationally efficient way. Even the simplest analyses of the data can be computationally intensive or may no longer be feasible for big data. It is however often the case that valid conclusions can be drawn by considering only some of the data, referred to as subdata. This project develops optimal strategies for selecting subdata that retain, as much as possible, relevant information that was available in the massive data set. The methodology helps to identify the most informative data points, after which an analysis can proceed based on the selected subdata only. This facilitates data-driven decisions, scientific discoveries, and technological breakthroughs with computing resources that are readily available.

Existing investigations for extracting information from big data with common computing power have focused on random subsampling-based approaches, which have as limitation that the amount of information extracted is only scalable to the subdata size, not the full data size. This project develops and expands the Information-Based Optimal Subdata Selection (IBOSS) method proposed by the PIs in the following directions: 1) It combines IBOSS with sparse variable selection methods in linear regression; 2) it develops subdata selection methods for generalized linear models; 3) it constructs computationally efficient algorithms for selecting the most informative subdata; and 4) it develops user-friendly software that supports the methodology. The research is a significant addition to the field of big data science. It advances a new method for dealing with big data and has the potential to create novel research opportunities in statistical science and other quantitative fields. The results are valuable even when supercomputers are available, because cutting edge high performance computing facilities will always trail the exponential growth of data volume.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1811291
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2018-07-15
Budget End
2021-06-30
Support Year
Fiscal Year
2018
Total Cost
$60,000
Indirect Cost
Name
University of Illinois at Chicago
Department
Type
DUNS #
City
Chicago
State
IL
Country
United States
Zip Code
60612