UM E-Theses Collection (澳門大學電子學位論文庫)

check Full Text

Syntax-based automatic tree-to-tree alignment for statistical machine translation

English Abstract

The research topic of Machine translation (MT) has become more and more popular in the natural language processing (NLP) domain. MT can be traced back to the 1950s, the idea is using digital computers to translate natural languages without human involvement, and it can also be regarded as automatic translation. Nowadays, with the increasing researches to MT, Data-Oriented Translation (DOT) of MT already achieves state-of-the-art results, since a well parallel aligned tree pairs can make the MT improve a lot. However the parallel aligned Treebanks have rarely existed and usually are not opened to the public. In recent years, along with the improvement of computer’s power and the fast growing of the data, how to extract, process and fully utilize the useful data becomes a new challenge to the research communities. Manually construction of aligned tree pairs is very time-consuming and expensive. It requires people to be trained with knowledge of linguistics and proficient at two or more languages. Therefore, in order to seek for a cheaper, faster, repeatable and high quality method, researchers start to investigate and study the automatic tree alignment solution. The accuracy of current automatic tree alignment methods is far from enough and still has many weaknesses. More efforts are required to put forward the research in this area. In this thesis, we reviewed the traditional manual alignment approaches. In spite of the quality of those parallel Treebanks are very high, it is infeasible to construct a large scale parallel Treebank that required for modern natural language processing tasks, such as machine translation which usually needs more than millions of sentences for training. Subsequently, some automatic alignment approaches have been proposed. Most of them are based on the word alignment information and some of them even require human intervention. If any part of those processes goes wrong, it will influence the overall alignment accuracy. Therefore, we need a more robust and stable model for the alignment of parallel trees. To this objective, this thesis proposes a syntax-based model for the generation of a high quality parallel Treebank. In order to deal with different annotation tagset among languages, we propose the use of universal tagset as unique scheme among the languages. This includes the tagsets both for the Part-of-Speech and the syntactic category (for phrase level). It can effectively reduce the interference caused by different annotation standards in the task of parallel Treebank construction. Our proposed model relies on a minimum external resources, but is robust and effective enough to achieve the goal in building parallel Treebanks automatically. The proposed model is evaluated against state-of-the-art approach reported in the literature on Chinese-English and Chinese-Portuguese pairs. Empirical results show that our model always gives a better alignment result.

Issue date



Xing, Jun Wen


Faculty of Science and Technology


Department of Computer and Information Science




Algorithms -- Data processing

Machine translating


Wong, Fai

Chao, Sam

Files In This Item

Full-text (Intranet only)

1/F Zone C
Library URL