University of Macau Library

UM E-Theses Collection (澳門大學電子學位論文庫)

Full Text

Title

Word representation and its applications in natural language processing

English Abstract

In the field of machine learning, two main factors are affecting the performance of the classifiers. One is the learning approach, while the other is the feature. Word representation is a new powerful feature that is being wildly used in recent years. Word representation is to represent a word by using either a binary code or a numerical vector. The basic function of word representation is to distinguish it with other words and beside this, we wish the representation is able to carry as much task related information as possible. The word representation is always endowed with semantic, grammatical or syntactic related information in the field of natural language processing. In natural language processing, many tasks need some well defined features to deal with such as part-of-speech tags, named entity tags, Chinese word segmentation tags, etc. Therefore many scholars focus on the feature engineering based on the task. The traditional features are mainly designed by human experts. This characteristic leads to two main drawbacks: 1. The human designed features are relatively difficult to be reused in other tasks and languages. 2. The designed features are usually defined by linguists or engineers with experimental proof. To acquire such features, we have to manually annotate the training materials first which is extremely time consuming. Word representation was born to overcome the above drawbacks. In this thesis, we will address the challenge of natural language processing in the field of feature engineering, with emphasis on automatically extracted features. This thesis firstly introduces the three types of word representation. Then introduces two different types of natural language processing tasks: sequence labelling and text classification. To distinguish Western languages with Asian languages, this thesis further divides the two types of tasks into three: English sequence labelling, Chinese sequence labelling and Chinese text classification. These three tasks involve both Chinese and English, sequence labelling and text classification. As a summary this thesis has the following contributions: (1) we applied a multilayer neural network to prove that human defined feature can be replaced by word representation. With word representation we can approximate the traditional approach and even approximate state-of-the-art approach. (2) We proposed a clustering procedure to take advantage of word representation in Conditional Random Fields model. (3) We achieve the state-of-the-art result in Chinese sentiment analysis using recursive autoencoder. The most noteworthy is our result excluded the advantage of any human effort (human defined features such as polarity dictionary). It is the first time that recursive autoencoder being applied in Chinese sentiment analysis.

Issue date

2014.

Author

Zong, Hao

Faculty

Faculty of Science and Technology

Department

Department of Computer and Information Science

Degree

M.Sc.

Subject

Natural language processing (Computer science)

Chinese language

Supervisor

Wong, Fai

Files In This Item

Full-text (Intranet only)

Location

1/F Zone C

Library URL

991008658459706306