Data-driven design is leading to unprecedented performance improvements in many widely used systems. Examples of recent successes can be found in speech recognition, advanced video analysis or imaging-based medical diagnosis, to name just a few. This project is motivated by the observation that more data does not always lead to better system design. In fact, extensive use of poorly understood data can create significant risks once systems are deployed. For example, data may introduce bias toward specific system outputs (e.g., lead to incorrect diagnoses) or performance might degrade significantly under even small changes in data collection (e.g., microphone characteristics, camera resolution). These risks are a major obstacle to wider adoption of data-driven tools, in particular in critical applications. This project develops methods to select data for improved system design, based on new models for large scale datasets. The ultimate goal of the project is to reduce deployment risk by designing systems based on the most representative dataset rather simply using the largest dataset.

In many applications, such as sensing, anomaly detection, classification, recognition or identification, systems are designed by first collecting significant amounts of data, and then optimizing system parameters using that data. As task complexity, data size and the number of system parameters increase, system analysis and characterization tasks become a major challenge, with estimates often based on end-to-end performance on the training set. Examples of these tasks include (i) estimating system accuracy, (ii) characterizing system stability to changes in data, (iii) determining the correct amounts of data needed for training or (iv) predicting their ability to generalize to different situations. In this project, graph-based approaches are developed to characterize large datasets in high dimensional space. This research is focused on theoretical, algorithmic and practical aspects of system characterization and design. On the theoretical front, this project tackles the problem of designing graphs that capture relevant properties of the data space, developing asymptotic results to link the distribution of the data to properties of graphs and related graph signals. On the algorithmic front, efficient methods for graph construction and task complexity estimation are developed, with the goal of enabling selection of the most representative dataset. As an application, practical deep learning architectures are considered, methods to increase their robustness are studied, and new strategies for active and transfer learning are developed.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2020-07-01
Budget End
2023-06-30
Support Year
Fiscal Year
2020
Total Cost
$500,000
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089