The ability for organizations to process enormous quantities of data and to extract insights from those data has revolutionized commerce and science. This phenomenon, known as "big data", is shaping the very fabric of our society. Our insatiable appetite for more data, and knowledge from the data, requires significant computational infrastructure for storage and analytical capabilities. Continued investments in infrastructure for academic researchers are vital from two perspectives: From the research perspective, the university's ability to help advance the state of the art in big data technologies is dependent on access to the right computational resources. From the educational perspective, the university's mission to train the next generation of scientists and engineers cannot be successfully accomplished without big data infrastructure that is becoming essential to their careers. The goal of this project is to provide computational resources to researchers at the University of Maryland to continuing envisioning the future of big data.

The modern empirical approach to tackling many challenges in natural language processing, information retrieval, data mining, machine learning, and other related domains involves exploiting large amounts of data to learn statistical models that are able to capture characteristics of the problem. A necessary ingredient to this "big data" approach is scalable infrastructure that can distribute computations across a cluster of machines. Hadoop, the open-source implementation of MapReduce, has achieved widespread adoption as the de facto platform for data-intensive computing.

Broadly speaking, MapReduce excels at large-scale content analysis in an offline, batch setting. However, this is not enough: we need a data-intensive computing platform that supports heterogeneous models of computation. Hadoop NextGen (aka YARN), provides exactly this: it allows a physical cluster to support a wide range of computational models via a generic resource allocation framework.

This project supports the acquisition of a Hadoop NextGen cluster at the University of Maryland to support the following activities:

1. To explore computational models beyond MapReduce, including batch/online tradeoffs in machine learning, real-time streaming computations, and graph processing.

2. To sustain innovations in algorithms for content analysis as well as modeling implicit and latent relationships between heterogeneous content (text, images, graphs, etc.) at scale.

3. To exploit novel hardware architectures for data-intensive computing (e.g., Graphics Processing Units and Solid State Drives).

These resources will help the Laboratory for Computational Linguistics and Information Processing (CLIP) and collaborators at the University of Maryland sustain and enhance its successful record of innovation and the integration of research and education.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1405688
Program Officer
Aidong Zhang
Project Start
Project End
Budget Start
2014-08-01
Budget End
2017-07-31
Support Year
Fiscal Year
2014
Total Cost
$499,852
Indirect Cost
Name
University of Maryland College Park
Department
Type
DUNS #
City
College Park
State
MD
Country
United States
Zip Code
20742