This SGER project seeks to determine the scalability of computationally intensive, iterative statistical learning algorithms on a MapReduce architecture. Such algorithms underlie much research in natural language processing, yet their scalability to even moderately large training datasets (text corpora) has been under-explored. On the surface, scaling to more data appears to be a good fit for the MapReduce paradigm, and this exploratory project aims to identify whether such algorithms benefit from more data and more complex data than used in prior work. A special emphasis is given to unsupervised learning algorithms, such as the Expectation-Maximization algorithm, which have been widely studied on small problems and rarely studied on large ones. The technique is applicable to many other methods, as well.

At the same time, the project seeks to explore how to leverage supercomputers and MapReduce to make these learning algorithms faster, permitting a faster research cycle. Concretely, the "E step" (or its analogue) is the most computationally demanding part of an iteration, but the standard assumption that the training data are independently and identically distributed permits parallelization. To the extent that this parallelization is affected by network and input-output overhead, each iteration of training may be made faster, perhaps reducing training time from days or weeks to hours. This project explores this tradeoff and others like it.

This work leverages a resource donated by Yahoo for use by the PI's research group: a 4,000-node supercomputer running Hadoop (an open-source implementation of MapReduce).

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0836431
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2008-07-01
Budget End
2009-12-31
Support Year
Fiscal Year
2008
Total Cost
$212,721
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213