With the growth of the Web and improvements in data collection technology in Science, datasets have been rapidly increasing in size and complexity, necessitating comparable scaling of machine learning algorithms. However, designing and implementing efficient parallel machine learning algorithms is challenging and time consuming. To address this challenge, we recently released GraphLab, a framework providing an expressive and efficient high-level abstraction satisfying the needs of a broad range of machine learning algorithms. The performance of our system has attracted significant attention, receiving thousands of downloads from many universities and companies.
Currently, GraphLab only addresses batch processing in multicore settings. In this project, we are developing GraphLab 2: addressing the much more challenging online and distributed settings, tackling: 1) Cloud-based distributed machine learning. 2) Natural graphs, with very high-degree vertices that are not amenable to graph partitioning methods. 3) Online tasks, where data and queries are streaming over time. 4) Off-core computation, since huge problems may not fit into memory, even across the cloud.
One of the key contributions of the project is the continual dissemination and transfer of our technology. Our open-source software releases will continue to enable large-scale machine learning applications in science and engineering.
Our ambitious broader impact goals, beyond theory and systems, include the development of a new curriculum focused on preparing students for the industrial and scientific needs in this field. Our proposed courses include "Machine Learning on the Web" and "Cloud Computing for Big Machine Learning and Data Mining."