A scalable hierarchical clustering algorithm using spark

Chen Jin, Ruoqian (Rosanne) Liu, Zhengzhang Chen, William Hendrix, Ankit Agrawal, Alok Choudhary

Abstract

Clustering is often an essential first step in data mining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can offer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of single-linkage clustering algorithm due to its natural expression of iterative process. Our algorithm can be deployed easily in Amazon's cloud environment. And a thorough performance evaluation in Amazon's EC2 verifies that the scalability of our algorithm sustains when the datasets scale up.

Venue

In 2015 IEEE First International Conference on Big Data Computing Service and Applications.

BibTeX

@inproceedings{jin2015scalable, title={A scalable hierarchical clustering algorithm using spark}, author={Jin, Chen and Liu, Ruoqian and Chen, Zhengzhang and Hendrix, William and Agrawal, Ankit and Choudhary, Alok}, booktitle={2015 IEEE First International Conference on Big Data Computing Service and Applications}, pages={418–426}, year={2015}, organization={IEEE}}

Date

March, 2015

Links

BigDataService 2015 PDF