This SBIR Phase I research project by MetaCarta proposes to introduce a novel annotation technique, parallel bootstrapping, to take advantage of the existing data sets in creating high quality training material for toponym extraction and resolution. Information Retrieval (IR) systems that can deal with Arabic already exist, but perform no Geographic Information Retrieval (GIR). As the experience of MetaCarta's users shows, it is practically impossible to retrofit standard keyword-based IR systems to perform GIR at a high level, so the only way to achieve Arabic GIR capability is to start with a GIR system. The availability of a high quality English GIR system makes it possible to address the greatest bottleneck of machine learning projects, the lack of manually truthed training data, by an innovative parallel bootstrap technique. Much of disambiguation, and in general, the extraction of semantic content from text, is still performed by rule-based systems that summarize expert knowledge of a domain. In contrast, MetaCarta employs machine-learning techniques that combine Hidden Markov and Maximum Entropy methods. For Arabic, we propose to restrict the rule-based component to morphological analysis, with later stages, in particular the extraction and disambiguation of toponyms to be performed by systems trained on truthed Arabic text. While plain (untruthed) Arabic text is now available in large quantities, see in particular the Arabic Gigaword corpus produced by the Linguistic Data Consortium (LDC), the amount of tagged material is considerably less, and the detail truth values required for toponym extraction and disambiguation are extremely labor-intensive to create by manual annotation. MetaCarta will use as input the LDC 2004T17 and T18 parallel corpora, running the English side through the existing MetaCarta system to produce the in-depth toponym annotation, and projecting back this annotation on the Arabic side.

This technology has broad appeal to customers that have an interest in extending GIR to Arabic documents. Representative customers are highly interested in activities restricted to narrow geographic confines, and many of the documents providing information about Middle Eastern areas of key strategic importance are available only in Arabic. Deploying Arabic GIR would also enable the analysts to more rapidly focus on the relevant documents.

Project Start
Project End
Budget Start
2006-07-01
Budget End
2006-12-31
Support Year
Fiscal Year
2006
Total Cost
$99,900
Indirect Cost
Name
Metacarta Inc
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02139