TY - JOUR
T1 - MR-DBSCAN
T2 - A scalable MapReduce-based DBSCAN algorithm for heavily skewed data
AU - He, Yaobin
AU - Tan, Haoyu
AU - Luo, Wuman
AU - Feng, Shengzhong
AU - Fan, Jianping
N1 - Funding Information:
Acknowledgements This research was partly supported by China National Science and Technology Pillar Program(2012BAH07B01), Knowledge Innovation Project of the Chinese Academy of Sciences (KGCX2-YW-131), and Strategic Priority Research Program of the Chinese Academy of Sciences (XDA06010500). We appreciate the reviewers and editor for their valuable suggestions and comments to improve this work.
PY - 2014/2
Y1 - 2014/2
N2 - DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.
AB - DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.
KW - data clustering
KW - data mining
KW - load balancing
KW - parallel algorithm
UR - http://www.scopus.com/inward/record.url?scp=84892438289&partnerID=8YFLogxK
U2 - 10.1007/s11704-013-3158-3
DO - 10.1007/s11704-013-3158-3
M3 - Article
AN - SCOPUS:84892438289
SN - 2095-2228
VL - 8
SP - 83
EP - 99
JO - Frontiers of Computer Science
JF - Frontiers of Computer Science
IS - 1
ER -