UM E-Theses Collection (澳門大學電子學位論文庫)


Statistical machine translation for typologically-different languages

English Abstract

Morphological difference and word reordering of language pairs are two main difficult problems in statistical machine translation (SMT) field. The tokens in morphologically rich language contain multiple grammatical senses. For instance, the Korean verb final endings contain verb stems and final endings. In most of cases, the state of the art SMT model as known as the phrase-based model which translates the phrases together to get a better translation result when dealing with short range word order comparing the traditional word-based model. However, for language pair which has quite different word order (long range distance), the original phrase-based model shows its weakness to reorder the words of two languages. Although there are a lot of methods trying to figure out these problems, the common grounds of these issues are: a lot of rules are needed to analyze and decompose the morphologically richer words. Meanwhile, the reordering rules are extracted from training corpora manually or automatically to get a better performance of translation. In fact, both of these problems will arouse the automatic word alignment result both on the training and decoding step of SMT system. Typically in the training step, if a verb final ending aligned with many corresponding words or some tokens aligned with discontinues words, the extraction of phrase pairs would be quite different. For decoding part, the decoder will search different words if the probability of word alignment is different in the training model. In this thesis, a morphological analysis-based sentence restructuring and alignment-based word reordering method is proposed. Besides two methods, a POS factored translation model and two hybrid models are applied in our experiments. The POS and morphological information is introduced to resolve these two problems. One contribution of this research is to use the results of morphological analyzer for morphologically rich languages to isolate the lemmas of infected words from its suffixes. The second contribution is that a language-independent word reordering method is proposed which does not need any additional resources such as morphological analysis tool and syntactic parser to monotonically adjust the words sequence of parallel sentences. The third contribution of this work is to design two hybrid translation models by integrating the proposed methods. The experiment result shows that the hybrid models take the advantages of the proposed methods and give the best translation result.

Issue date



Li, Shuo


Faculty of Science and Technology


Department of Computer and Information Science




Machine translating


Wong, Fai

Chao, Sam

Files In This Item

TOC & Abstract


1/F Zone C
Library URL