Advances in sensing and data acquisition technologies make it easy to generate vast quantities of data that must be stored, communicated, processed, and understood. This data may be of variable quality and the nature of the data may vary over time -- this variability can cause difficulties for existing approaches to efficiently represent the data. Current methods use economical representations of the data in terms of a smaller number of properties, or features, of the raw data. Standard feature representations such as Fourier and wavelet representations may not be efficient at representing the data from these new acquisition technologies. One paradigm to overcome this mismatch is the data-driven approach, in which an algorithm processes the data to learn a novel and efficient feature representation for the given data. While these are more useful, such approaches may not scale well to massive data sets.

This work designs new methods for data-driven feature learning that are scalable and robust to noisy, time-varying data. It proposes an "active screening" approach to learning new features; the data processing algorithm uses a low-complexity criterion to screen for useful and informative points for feature learning. Advantages of active screening include a reduction in the computational and storage overhead as well as the ability to reject outliers or other spurious and misleading data. The investigators develop active screening methods for consistent estimation under generative models for the data, analyze the tradeoff between representation and classification in active screening for discriminative dictionary learning, and extend the active screening analysis to distributed settings for distributed dictionary learning. They investigate the promise of these methods on two large-scale electrocardiography (ECG) datasets of 170+ patients. This work combines ideas from statistics (feature screening) and machine learning (active learning and selective sampling) to design efficient representations of complex signals from massive data sets and may inform the design of new data acquisition technologies by incorporating screening ideas into the technologies themselves.

Project Start
Project End
Budget Start
2015-09-01
Budget End
2017-08-31
Support Year
Fiscal Year
2015
Total Cost
$160,000
Indirect Cost
Name
Rutgers University
Department
Type
DUNS #
City
Piscataway
State
NJ
Country
United States
Zip Code
08854