UM E-Theses Collection (澳門大學電子學位論文庫)

check Full Text

Recursive neural network based data selection for statistical machine translation

English Abstract

RECURSIVE NEURAL NETWORK BASED DATA SELECTION FOR STATISTICAL MACHINE TRANSLATION by Lu Yi Thesis Supervisor: Lidia S. Chao and Derek F. Wong Master of Science in Software Engineering Statistical machine translation (SMT) system performs well when the text to be translated is similar to the corpus which the SMT system is trained on. In practice, SMT systems are often trained on large and broad corpora that collect from various sources and contain texts from multiple domain. When these systems are applied to domain-specific translation task, their performance are not optimal due to the mismatch between the test set and training set. Domain adaptation is a research topic that aims at improving translation performance of SMT system in domain-specific task. Data selection is a widely used and effective solution to domain adaptation in statistical machine translation (SMT). The dominant methods are perplexity-based, which do not consider the mutual translations of sentence pairs and it tends to select short sentences. In this paper, to address these problems, we propose monolingual and bilingual semi-supervised recursive autoencoder data selection methods to differentiate domain relevant data from out-domain data. The proposed methods are evaluated in the task of building domain adapted SMT systems. We present extensive comparisons and show that the proposed methods outperform the state-of-the-art data selection approaches.

Issue date



Lu, Yi


Faculty of Science and Technology


Department of Computer and Information Science




Computational linguistics -- Databases

Machine translating


Chao, Sam

Files In This Item

Full-text (Internet)

1/F Zone C
Library URL