Machine learning (ML) algorithms have become ubiquitous across applications as diverse as science, engineering, business, finance, education and healthcare. However, development of ML software that can scale to massive datasets and that are also easy-to-use remains a challenge in part due to the fact that developing an ML tool currently requires the implementation of a deep software stack, from the actual runtime (i.e., how an ML algorithm is executed) to the API exposed to the users.

This project aims to develop DeML, a system to support the authoring and execution of ML tools. Specifically, DeML would allow ML algorithms to be formulated in the form of a declarative query over the training dataset. DeML optimizes the execution of the query over a computing platform (e.g., Amazon EC2 or SQL Azure), taking into account the characteristics of the algorithm, the data, and the available computational resources. Adoption of DeML would greatly reduce the effort required to develop scalable implementations of ML algorithms. The project is organized around three thrusts: (i) Development of a declarative query language, based on extensions of Datalog; (ii) Analysis of runtime of DeML queries; (iii) Optimization of dataflow of DeML queries based on the characteristics of data sources and the capabilities of the underlying execution platform. The resulting open source DeML prototype implementation will be made freely available to the community through the project web page at: http://deml.cs.ucla.edu.

The availability of the DeML could greatly lower the effort needed to author scalable implementations of ML algorithms for analysis of massive datasets, which in turn would increase the availability of such tools to the broader community. Experience gained by implementing and deploying ML algorithms at scale over modern cloud-computing platforms, could help inform critical design choices in the development of future cloud computing platforms for big data analytics, and hence impact a broad range of scientific, engineering, national security, healthcare and business applications of big data analytics. The project offers enhanced opportunities for research-based advanced training of graduate and undergraduate students, including members of groups that are currently under-represented in computer science, in databases, machine learning, and cloud computing.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1302698
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2013-09-01
Budget End
2017-08-31
Support Year
Fiscal Year
2013
Total Cost
$667,000
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095