Natural language processing, information extraction, information integration and other text processing solutions are central components of computer science, and key tools for addressing the ever-increasing problems in information overload. Issues of information overload are not only personal problems, but critical for business productivity, national defense, and increasingly government decision-making and transparency.

State-of-the-art natural language processing is increasingly based on machine learning. However, the methodologies can be complex, and software infrastructure necessary for such systems is generally difficult to develop from scratch. To address this need we have created MALLET (MAchine Learning for LanguagE) and FACTORIE (Factor graphs, Imperative, Extensible), open-source software toolkit that run in the Java virtual machine. They provide many modern state-of-the-art machine learning methods, specially tuned to be scalable for the idiosyncrasies of natural language data, while also applying well to many other discrete non- language tasks.

The project will fill three critical gaps: (1) broadening these toolkits' applicability to new data and tasks (with better end-user interfaces for labeling, training and diagnostics), (2) greatly enhancing their research-support capabilities (with infrastructure for flexibly specifying model structures), and (3) improving their understandability and support (with new documentation, examples, online community support).

The project will have a direct positive impact on NLP and other machine learning research, on teaching, and on collaborative research activities. Well-designed toolkits not only help researchers avoid duplicate implementation effort, but (a) they encourage sharing of algorithms and code, and thus also cultivate increased collaboration and intellectual flow of ideas; (b) they foster the communication of detailed clarity of algorithms and scientific reproducibility; (c) they help "level the playing field" by providing state-of-the-art implementations of foundational building blocks and recent methods to top-tier and small institutions alike; (d) they supply a teaching tool, not only by making it easy for students to experiment with the supplied research methodologies. Furthermore, by providing multiple ready-to-use systems, non-programmers will have access to modern, scalable implementations of text processing tools that will spread knowledge and use of these techniques across fields, to the social sciences, humanities, and bio-medical fields.

For further information see the project web site at the URL: www.cs.umass.edu/~mccallum/nsf-mallet

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
0958392
Program Officer
clifton bingham
Project Start
Project End
Budget Start
2010-06-01
Budget End
2014-05-31
Support Year
Fiscal Year
2009
Total Cost
$650,000
Indirect Cost
Name
University of Massachusetts Amherst
Department
Type
DUNS #
City
Amherst
State
MA
Country
United States
Zip Code
01003