This Small Business Innovation Research (SBIR) Phase II project embodies an innovative approach to machine translation. The proposed model aims to overcome two important bottlenecks in the development of a high quality statistical machine translation (SMT) system: (1) inability to handle structural problems and (2) dependence on huge amounts of parallel texts. The inability of statistics to sufficiently handle grammatical problems such as word order becomes more evident when the language pair is very different in structure and morphology, such as with English and Korean. The dependence on a huge amount of parallel texts is a great challenge especially to speech translation. Based on successful tests in the Phase I project, this project proposes a method to learn linguistic knowledge crucial to handling word order and non-local dependencies automatically from input and incorporate it into SMT along with simple transformations, maximizing the strength of both knowledge-based approaches and statistical approaches, and minimizing the need for ever-increasing amounts of bilingual data. The proposed approach aims to build a syntactic-phrase-based statistical machine translation engine that not only is more accurate than the existing word-based ones, but also can decrease the need for large data sources.
The primary impact of the proposed project is the potential for achieving automatic translation quality as high as the quality of the best knowledge-based machine translation engines; but with a minimum of handcrafting of knowledge and therefore at a much lower cost in terms of development time and human resources. While the research is specifically concerned with MT between English and Korean, the resulting translation models would potentially be usable for translation between any pair of languages. The result of the research will be used to develop a speech translation device, in particular to overcome language barriers in communication with patients in hospitals. It will provide a key technology that will accelerate development of speech translation applications in order to reduce costs of healthcare providers and to enhance the quality of healthcare. Additionally, the proposed method of learning linguistic features will have an impact on many different applications including speech recognition, search engines, genre and topic detection, and document search and query. Finally, the proposed research will have beneficial impacts nationally and globally by helping to solve the 'automatic translation' problem, an area of paramount importance to the economic welfare and security of the United States and the rest of the world.