DC: Small: Cross-Language Bayesian Models for Web-Scale Text Analysis Using MapReduce

Lin, Jimmy; Resnik, Philip; Boyd-Graber, Jordan

Abstract

The Web promises unprecedented access to the perspectives of an enormous number of people on a wide range of issues. Turning that still untamed cacophony into meaningful insights requires dealing with the linguistic diversity and scale of the Web. Most current research focuses on specialized tasks such as tracking consumer opinions, and virtually all current research treats the Web as both monolithic and monolingual, ignoring the variety of languages represented and the rich interplay between topics and issues under discussion.

This project moves the state of the art forward by focusing on two key challenges. First, highly-scalable MapReduce algorithms for linguistic modeling within a Bayesian framework, making use of variational inference to achieve a high degree of parallelization on Web-scale datasets. Second, novel Bayesian models that learn consistent interpretations of text across languages and a wide range of response variables of interest (for example, views on an issue, strength of emotion relative to an event, and focus of attention).

The techniques developed in this project will be demonstrated on large crawls of Web pages and blogs. Potential applications for these technologies include helping a schoolchild learn that people in different countries may view some issues very differently, helping a politician understand how constituents are reacting to proposed legislation, or helping an intelligence analyst understand how public opinion is evolving in a hostile country.

For further information see the project Web page: www.umiacs.umd.edu/~jimmylin/cloud-computing

Project Report

Dealing with "big data"---text documents in the form of a deluge of e-mail, news articles, or scientific research---is a challenge for computer scientists and everyday users. While there are techniques for machines to quickly summarize large collections of data, they are often limited to a single language, ignore valuable information (such as who authored a document), and don't allow users to provide feedback. Our project addresses these issues by creating scalable analysis techniques that span multiple languages, reflect important metadata, and allow users to correct problems with machine learning output. Our research primarily uses "topic models", computer-based models that can (as an example) look at a decade's worth of newspaper articles, automatically induce categories that correspond to "sports" and "politics", and automatically figure out which articles are predominantly about sports, which are predominantly about politics, and which are mixtures of the two. The first challenge we addressed was to scale up topic modeling techniques to large amounts of data, taking advantage of large clusters of computer servers to distribute the computations necessary to build these models. We then extended topic models to multiple languages in ways that can improve our ability to detect when people express opinions and also improved the ability of machine translation systems to use context-appropriate translations (e.g., the word "race" in a sports context would have a different translation than if it were in a political context). In addition to spanning multiple languages, we also improved topic models within a single language by incorporating information about the speaker in or the author of a document. This allows us to better detect an influential speaker in a debate or a meeting, and to detect when someone has a positive or negative opinion about a product (movie, book, device) online. However, any automatic technique can make mistakes; to correct this, we also developed "interactive topic models", which allow users to provide feedback in an easy-to-use fashion. When users see the result of a topic model, they can provide feedback of the form "word X does not belong in the same topic as word Y". The system can take such feedback and fold it into the model results, keeping the other topics consistent but correcting the problems users noticed. Our research helps bring sophisticated machine learning technology to ordinary users without technical backgrounds, allowing them to make sense of "big data" without needing advanced degrees in computer science. This work also contributes to applications that help people understand documents across many languages and cultures, particularly expressed opinions, thus improving international communication and community.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Type: Standard Grant (Standard)
Application #: 1018625
Program Officer: Maria Zemankova

Project Start
Project End
Budget Start: 2010-09-15
Budget End: 2014-08-31
Support Year
Fiscal Year: 2010
Total Cost: $449,953
Indirect Cost

DC: Small: Cross-Language Bayesian Models for Web-Scale Text Analysis Using MapReduce
Lin, Jimmy Resnik, Philip Boyd-Graber, Jordan
University of Maryland College Park, College Park, MD, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments