This Small Business Innovation Research (SBIR) Phase I project proffers an innovative approach to machine translation. The project model aims to overcome two important bottlenecks in the development of a high quality Statistical Machine Translation (SMT) system: (1) the inability to handle structural problems, and (2) dependence on huge amounts of parallel texts. The inability of statistics to sufficiently handle grammatical problems such as word order becomes more evident when the language pair is very different in structure and morphology, such as with English and Korean. This project is a method to learn linguistic knowledge crucial to handling word order and nonlocal dependencies automatically from text and incorporate it into SMT along with simple transformations, maximizing the strength of both knowledge-based approaches and statistical approaches, and minimizing the need for ever-increasing amounts of bilingual data. This approach aims to build a syntactic-phrase-based Statistical Machine Translation engine that is not only more accurate than the existing word-based ones but is also capable of decreasing the need for large data sources. The primary impact of the project is the potential for achieving automatic translation quality, which is as high as the quality of the best knowledge-based machine translation engines but which, at the same time, requires a minimum of handcrafting of knowledge and is therefore much lower cost in terms of development time and human resources.

While the research is specifically concerned with MT between English and Korean, the resulting translation models would potentially be usable for translation between any pair of languages. In addition to benefiting machine translation research and applications directly, the research will provide significant progress towards building bilingual phrase lexicons from data, which in turn will aid in multi-lingual tasks such as cross-lingual information retrieval. Sehda's syntactic phrase based MT engine can produce unambiguous phrase translations, useful for indexing foreign documents and constructing keyword lists for document summary. Additionally, the project's method to learn features to augment traditional language modeling will have an impact in many different applications including speech recognition, search engines, genre and topic detection, and document search and query. Lastly, this research has beneficial impacts nationally and globally by helping to solve the "automatic translation" problem, an area of paramount importance to the economic welfare and security of the US, as well as to the rest of the world.

Project Start
Project End
Budget Start
2005-01-01
Budget End
2005-06-30
Support Year
Fiscal Year
2004
Total Cost
$100,000
Indirect Cost
Name
Fluential , Inc.
Department
Type
DUNS #
City
Sunnyvale
State
CA
Country
United States
Zip Code
94089