Although existing machine learning (ML) methods are effective for analyzing data that are static in nature, many of today's applications from information retrieval to web searching require real-time processing of massive amounts of streaming data. Often, the data is available in an online fashion, meaning the entire dataset is not available at once, but rather, individual data instances arrive sequentially. Area under the Receiver Operating Characteristic Curve (AUC) has been proven to be an effective evaluation metric and learning objective in many application domains, but it is difficult to optimize directly. To date, there are no satisfactory approaches to incorporate AUC into online ML algorithms for classification and ranking problems. This project will address the fundamental theoretical and algorithmic challenges in ML algorithms based on AUC maximization for processing streaming, high-dimensional data with efficient algorithms and theoretical analysis. The results of this project are expected to have broader impact in intrusion detection for cyber-security, fault detection in safety critical systems, information retrieval and cancer diagnosis. The planned research will also integrate with educational activities, including developing new undergraduate/graduate courses on optimization and machine learning, organizing a workshop and making software tools freely available to the public. The principal investigators also plan to undertake outreach activities to improve STEM learning of high school students.

A central topic of this project is to develop efficient online learning algorithms for AUC maximization and bipartite ranking, making them amenable for online processing of high dimensional and large volume of streaming data. This project also establishes a rigorous statistical foundation and thorough theoretical analysis of the online AUC maximization algorithms developed. The primary technical challenge in developing online AUC maximization algorithms and theory is that the objective functions involve statistically dependent pairs of instances while individual instances continuously arrive in a sequential order. This is in contrast to standard classification based on accuracy, where the loss functions only depend on individual instances and the related algorithms and theory are well developed. The project will address this gap by sufficiently exploiting the structures of the population objective involved in the problem of AUC maximization, rather than an empirical objective on finite data, through novel interaction of machine learning, statistical theory and applied mathematics. This will help to remove the pairwise structure in the original objective function and facilitate the development of efficient online learning algorithms for AUC maximization.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2018-08-15
Budget End
2022-07-31
Support Year
Fiscal Year
2018
Total Cost
$498,333
Indirect Cost
Name
Suny at Albany
Department
Type
DUNS #
City
Albany
State
NY
Country
United States
Zip Code
12222