This project's aim is to build a statistical language model that is able to capture various kinds of regularities of natural language, mainly local lexical and long range syntactic, or semantic regularities to improve the performance of various natural language applications. It is conducted under the directed Markov random field paradigm to sequentially embed more advanced syntactic structure and/or semantic topic components plus to form complex distributions for natural language. By exploiting the particular structure of each composite language model, the seemingly complex statistical representations are decomposed into simpler ones; this enables the estimation and inference algorithms for the simpler composite language models to become internal building blocks for the estimation of complex composite language models, thus finally solving the estimation problem for extremely complex, high-dimensional distributions.
The composite language models are scalable and might significantly increase the performance of the state-of-the-art speech recognition and machine translation systems which would constitute an important contribution to the language modeling research. The techniques developed in this project might not only lead to effective, robust and intelligent language technology applications, but also might be extended and applied to solve problems in computational biology and computer vision. The project provides an excellent environment for interdisciplinary education in information technology that bridges areas of language and speech processing, machine learning and computational statistics, and theoretical computer science to benefit students of all levels.