University of Macau Library | UML Digital Resources Hub

UM Dissertations & Theses Collection (澳門大學電子學位論文庫)

Full Text

Title

Multidimensional similarity join using MapReduce

English Abstract

Join is an essential and most costly operation for many data analysis tasks, especially for very large amount data with the completely different features. MapReduce is a popular parallel framework in processing a large amount data, while the join operation is not supported by the MapReduce framework directly. Based on the data statistics, we present a technique of sampling-based selectivity estimation for optimizing multidimensional join processing by searching for the partition feature for the join process. Then we design two optimization methods based on data size and computation cost by using the frequency statistics. At last we design a new skyline to find a suitable partition plan based on different cluster configurations. We study the problem of multidimensional similarity join using MapReduce. Our work is to make full use of the feature property and difference of different features to compute the multidimensional similarity join that minimize job completion time using MapReduce. Meanwhile we can give some suggestions about the cluster to the system administrator for shorter execution time. All the algorithms do not require any modifications of the MapReduce framework. To the best of knowledge, this is the first study to do so.

Issue date

2015.

Author

Wang, Jian

Faculty

Faculty of Science and Technology

Department

Department of Computer and Information Science

Degree

M.Sc.

Subject

MapReduce (Computer program)

Parallel programs (Computer programs)

Files In This Item

Full-text (Internet)

Location

1/F Zone C

Library URL

991000758169706306