University of Macau Library | UML Digital Resources Hub

UM Dissertations & Theses Collection (澳門大學電子學位論文庫)

Full Text

Title

Graph-based data selection for statistical machine translation

English Abstract

The state-of-art phrase-based machine translation (MT) uses data-driven method to build translation systems. Thus, the translation system can be better as the training data increase in theory. However, the problem is that the performance of statistical machine translation (SMT) systems not only relies on the quantity of training corpus, but also is related to the quality. In practice, the translation model is often built by using large and multiple domain corpus. The performance is not optimal while it is used for the domain-specific translation task. The SMT system performs well only when the text to be translated is similar to the training corpus. Domain adaptation is proposed to solve this problem, it aims at improving translation performance of SMT systems in domain-specific task. The data selection is a hot sub-topic in domain adaptation. The score of language model is widely used in the data selection method. But it does not take the commonality into consideration. One sentence only can be classified to one domain. The main work of this thesis is trying to address this problem, we propose a novel graph-based label propagation method to do the data selection. Firstly, all sentences are represented to a vector using cross-entropy score information, an undirected k-dimension tree is built based on all vectors and their similarities. The similarity is calculated based on the Euclidean distance. For one node, only 25 nearest nodes are connected to it in the tree. Because more nodes may lead to quite time for the later label propagation part. Secondly, the nodes which come from the in-domain and out-of-domain are given the IN and OUT labels. Adsorption is adopted to propagation the label from labeled data to unlabeled data. All sentences are sorted according to its label scores of IN. Finally the training data is selected from the sorted sentences. The proposed method takes the commonality into consideration, it is evaluated on a multiple domain corpus, and some comparison experiments are shown to indicate that our method is better compared with the state-of-the-art data selection approaches.

Issue date

2016.

Author

Wang, Yi Ming

Faculty

Faculty of Science and Technology

Department

Department of Computer and Information Science

Degree

M.Sc.

Subject

Machine translating

Graph labelings

Graph theory

Supervisor

Chao, Sam

Wong, Fai

Files In This Item

Full-text (Intranet only)

Location

1/F Zone C

Library URL

991005802629706306