UM E-Theses Collection (澳門大學電子學位論文庫)

check Full Text

Multidimensional similarity join using MapReduce

English Abstract

Join is an essential and most costly operation for many data analysis tasks, especially for very large amount data with the completely different features. MapReduce is a popular parallel framework in processing a large amount data, while the join operation is not supported by the MapReduce framework directly. Based on the data statistics, we present a technique of sampling-based selectivity estimation for optimizing multidimensional join processing by searching for the partition feature for the join process. Then we design two optimization methods based on data size and computation cost by using the frequency statistics. At last we design a new skyline to find a suitable partition plan based on different cluster configurations. We study the problem of multidimensional similarity join using MapReduce. Our work is to make full use of the feature property and difference of different features to compute the multidimensional similarity join that minimize job completion time using MapReduce. Meanwhile we can give some suggestions about the cluster to the system administrator for shorter execution time. All the algorithms do not require any modifications of the MapReduce framework. To the best of knowledge, this is the first study to do so.

Issue date



Wang, Jian


Faculty of Science and Technology




MapReduce (Computer program)

Parallel programs (Computer programs)

Software Engineering -- Department of Computer and Information Science

Files In This Item

Full-text (Intranet)

1/F Zone C
Library URL