II-EN: Hadoop NextGen Infrastructure for Heterogeneous Approaches to Data-Intensive Computing

Lin, Jimmy

Abstract

The ability for organizations to process enormous quantities of data and to extract insights from those data has revolutionized commerce and science. This phenomenon, known as "big data", is shaping the very fabric of our society. Our insatiable appetite for more data, and knowledge from the data, requires significant computational infrastructure for storage and analytical capabilities. Continued investments in infrastructure for academic researchers are vital from two perspectives: From the research perspective, the university's ability to help advance the state of the art in big data technologies is dependent on access to the right computational resources. From the educational perspective, the university's mission to train the next generation of scientists and engineers cannot be successfully accomplished without big data infrastructure that is becoming essential to their careers. The goal of this project is to provide computational resources to researchers at the University of Maryland to continuing envisioning the future of big data.

The modern empirical approach to tackling many challenges in natural language processing, information retrieval, data mining, machine learning, and other related domains involves exploiting large amounts of data to learn statistical models that are able to capture characteristics of the problem. A necessary ingredient to this "big data" approach is scalable infrastructure that can distribute computations across a cluster of machines. Hadoop, the open-source implementation of MapReduce, has achieved widespread adoption as the de facto platform for data-intensive computing.

Broadly speaking, MapReduce excels at large-scale content analysis in an offline, batch setting. However, this is not enough: we need a data-intensive computing platform that supports heterogeneous models of computation. Hadoop NextGen (aka YARN), provides exactly this: it allows a physical cluster to support a wide range of computational models via a generic resource allocation framework.

This project supports the acquisition of a Hadoop NextGen cluster at the University of Maryland to support the following activities:

1. To explore computational models beyond MapReduce, including batch/online tradeoffs in machine learning, real-time streaming computations, and graph processing.

2. To sustain innovations in algorithms for content analysis as well as modeling implicit and latent relationships between heterogeneous content (text, images, graphs, etc.) at scale.

3. To exploit novel hardware architectures for data-intensive computing (e.g., Graphics Processing Units and Solid State Drives).

These resources will help the Laboratory for Computational Linguistics and Information Processing (CLIP) and collaborators at the University of Maryland sustain and enhance its successful record of innovation and the integration of research and education.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Type: Standard Grant (Standard)
Application #: 1405688
Program Officer: Aidong Zhang

Project Start
Project End
Budget Start: 2014-08-01
Budget End: 2017-07-31
Support Year
Fiscal Year: 2014
Total Cost: $499,852
Indirect Cost

II-EN: Hadoop NextGen Infrastructure for Heterogeneous Approaches to Data-Intensive Computing
Lin, Jimmy
University of Maryland College Park, College Park, MD, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments