In the modern era truly enormous amounts of data are constantly being generated across a wide range of domains: these include ongoing large-scale scientific experiments, ubiquitous smartphones and sensors, the continuous production and evolution of content on social media, and many others. How can this flood of data be efficiently processed and analyzed? A branch of computer science called "property testing" seeks to develop ultra-fast algorithms for analyzing massive data sets to quickly determine whether or not the data has some property of interest. However, the standard theoretical models that have mostly been considered in property testing are not well suited to many real-world data analysis scenarios; these standard models prioritize mathematical elegance, but the resulting assumptions they make do not align well with the abilities of actual data analysis algorithms or with the nature of many actual data sets. (As one example, these models typically assume that a data analysis algorithm can synthesize arbitrary data points and query them to receive accurate information about how such data points should be labeled, but such queries are impossible in many real-world settings where data points "come as they are" and cannot be synthesized to meet the specifications of a data analyst. As another example, these models typically can only deal with data which is assumed to follow certain highly structured probability distributions, but real-world data is messy and rarely possesses such a high degree of structure.) The high-level goal of this project is to develop and analyze non-standard models of property testing, with the explicit goal of developing algorithms which align with the realities and constraints of real-world data analysis problems. An important related goal is to foster human resource development by performing outreach and training graduate students, including members of historically under-represented groups, in the analytic and algorithmic techniques that are central to this project. Planned activities to achieve broader impacts also include new courses, survey articles, and the continuation of outreach activities aimed at students at the elementary and middle school levels.

In more detail, the project will focus on three different aspects of property testing algorithms for big data, all of which are motivated by considerations arising from real-world data analysis:(1) The first focus of the project will be on developing flexible algorithms for testing whether a massive high-dimensional data set has been labeled according to a "junta" --- this is a labeling rule which depends only on a very small but unknown set of data features out of a huge set of possible features. Building on their previous work, the investigators will work to develop junta testing algorithms which can handle arbitrary data distributions and noisy data, and can succeed even given only a limited form of access to the data set being analyzed. (2) The second focus of the project will be on transferring ideas and techniques from theoretical machine learning algorithms to the domain of property testing of massive data sets. Previous work of the investigators gave a proof-of-concept for how (certain relatively inefficient) machine learning algorithms can be modified to yield far more efficient property testing algorithms for data analysis, but this transfer went through only in the relatively constrained standard models of property testing, alluded to above, which assume highly structured data distributions. In this project the investigators will work to extend these earlier results so that the machine learning techniques will yield algorithms for more flexible property testing models that are of greater real-world applicability.(3) Finally, the third focus of the project is to develop property testing algorithms which do not need to make queries on synthetic data points but instead use only random samples, and which can be applied to high-dimensional continuous data sets. Data of this type arises commonly in settings where sensors or measurements of different sorts are generating the data, but most property testing algorithms are designed for discrete binary-valued data rather than continuous data.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1838154
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2019-01-01
Budget End
2021-12-31
Support Year
Fiscal Year
2018
Total Cost
$910,000
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027