Machine learning (ML) is one of the fastest growing research areas within computer science and is enabling a multitude of new applications that span medicine, the sciences, engineering, business, and more. A cornerstone in the steady progress of machine learning over the past 30 years has been regular and continuous systematic empirical comparisons of different ML algorithms on testbed ML datasets. However, as the numbers of ML research papers, algorithms, and datasets all continue to grow rapidly, ML researchers are faced with information overload, making it near impossible to keep track of the latest performance advancements in a systematic manner. In turn, this creates significant inefficiencies for researchers, slowing down the pace of advances in new ML research and applications. This project will address these issues by building upon the success of the existing University of California - Irvine (UCI) Machine Learning Repository, a well-known and widely-used online public repository of ML testbed datasets that ML researchers use to evaluate and track progress in ML algorithm development. This project will extend the Repository with additional information about the datasets and steps to replicate the results; both can significantly amplify the productivity of ML researchers. The project will do so by harnessing significant ML community engagement and outreach in its design and operation, and will also provide training to students from different backgrounds on how to engage with, and contribute to, machine learning research. Given that ML algorithms are now being applied to prediction problems in areas as diverse as climate science, judicial decisions, and personalized medicine, the advances in scientific reproducibility from this project, in terms of systematic evaluation of ML algorithms, have potentially far-reaching societal and scientific benefits.

The existing UCI Machine Learning Repository directly impacts tens of thousands of ML researchers and students, by providing a standard and widely-cited set of testbed datasets to support both research and education. This project will involve building the next version of the Repository that will provide rich metadata for ML datasets, linking datasets to research papers and automatically extracting metadata and performance data in leaderboard style. The new Repository will also provide systematic support for reproducible science by allowing users to readily validate empirical ML results on testbed datasets. This project will lead to research advances in two aspects. The first aspect will be the development of new methods and algorithms for information extraction of metadata from the scientific literature. The second aspect will result from improvements in the way ML researchers carry out their experimental work. Providing tools to support broader, more systematic, and more reproducible evaluations of ML algorithms, will lead to ML advances that are more robust, better calibrated, and more likely to operate well when used in real-world environments.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1925741
Program Officer
Wendy Nilsen
Project Start
Project End
Budget Start
2019-10-01
Budget End
2022-09-30
Support Year
Fiscal Year
2019
Total Cost
$1,792,952
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697