Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

Abstract Details

2013, Doctor of Philosophy (PhD), Wright State University, Computer Science and Engineering PhD.
The n-gram model is the most widely used language model (LM) in statistical machine translation system, due to its simplicity and scalability. However, it only encodes the local lexical relation between adjacent words and clearly ignores the rich syntactic and semantic structures of the natural languages. Attempting to increase the order of an n-gram to describe longer range dependencies in natural language immediately runs into the curse of dimensionality. Although previous researches tried to increase the order of n-gram on a large corpus, they did not see obvious improvement beyond 6-gram. Meanwhile, other LMs, such as syntactic language models and topic language models, tried to encode the long range dependencies from different perspectives of natural languages. But it is still an open question how to effectively combine those language models in order to capture multiple linguistic phenomena. This dissertation presents a study at building a large scale distributed composite language model that is formed by seamlessly combining an n-gram model, a structured language model, and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm. To improve word prediction power, the composite LM is distributed with client-server paradigm and trained on corpora with up to a billion tokens. Also, the orders of the composite LM are increased up to 5-gram and 4-headword. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system. Moreover, we propose an A*-search-based lattice rescoring strategy in order to integrate the large scale distributed composite language model into a phrase-based machine translation system. Experiments show that the A*-based lattice re-scoring is more effective to show the predominance of the composite language model over the n-gram model than the traditional N-best list re-scoring.
Shaojun Wang, Ph.D. (Advisor)
Amit Sheth, Ph.D. (Committee Member)
Keke Chen, Ph.D. (Committee Member)
Krishnaprasad Thirunarayan, Ph.D. (Committee Member)
Xinhui Zhang, Ph.D. (Committee Member)
121 p.

Recommended Citations

Citations

  • Tan, M. (2013). A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation [Doctoral dissertation, Wright State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=wright1386111950

    APA Style (7th edition)

  • Tan, Ming. A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation. 2013. Wright State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=wright1386111950.

    MLA Style (8th edition)

  • Tan, Ming. "A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation." Doctoral dissertation, Wright State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=wright1386111950

    Chicago Manual of Style (17th edition)