Efficient similarity joins on massive high-dimensional datasets using MapReduce

Wuman Luo, Haoyu Tan, Huajian Mao, Lionel M. Ni

研究成果: Conference contribution同行評審

25 引文 斯高帕斯(Scopus)

摘要

High-dimensional similarity join (HDSJ) is critical for many novel applications in the domain of mobile data management. Nowadays, performing HDSJs efficiently faces two challenges. First, the scale of datasets is increasing rapidly, making parallel computing on a scalable platform a must. Second, the dimensionality of the data can be up to hundreds or even thousands, which brings about the issue of dimensionality curse. In this paper, we address these challenges and study how to perform parallel HDSJs efficiently in the MapReduce paradigm. Particularly, we propose a cost model to demonstrate that it is important to take both communication and computation costs into account as dimensionality and data volume increases. To this end, we propose DAA (Dimension Aggregation Approximation), an efficient compression approach that can help significantly reduce both these costs when performing parallel HDSJs. Moreover, we design DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and very high dimensionality. We perform extensive experiments using both synthetic and real datasets to evaluate the speedup and the scale up of our algorithms.

原文English
主出版物標題Proceedings - 2012 IEEE 13th International Conference on Mobile Data Management, MDM 2012
發行者IEEE Computer Society
頁面1-10
頁數10
ISBN(列印)9780769547138
DOIs
出版狀態Published - 2012
對外發佈
事件2012 IEEE 13th International Conference on Mobile Data Management, MDM 2012 - Bengaluru, Karnataka, India
持續時間: 23 7月 201226 7月 2012

出版系列

名字Proceedings - 2012 IEEE 13th International Conference on Mobile Data Management, MDM 2012

Conference

Conference2012 IEEE 13th International Conference on Mobile Data Management, MDM 2012
國家/地區India
城市Bengaluru, Karnataka
期間23/07/1226/07/12

指紋

深入研究「Efficient similarity joins on massive high-dimensional datasets using MapReduce」主題。共同形成了獨特的指紋。

引用此