Fast clustering of big data through successive bi-partitioning

Journal Title
Journal ISSN
Volume Title

Data analysis plays an indispensable role in understanding various phenomena and discovering useful information. Clustering, one of the most used data mining techniques, is an unsupervised, iterative machine learning method used for statistical data analysis. A clustering process groups data into clusters with high similarity for data within each cluster but large dissimilarity between different clusters. The K-means algorithm and its variants are probably the most successful clustering methods for numerical data.

For categorical data, the K-modes clustering algorithm is deemed the method of choice as it possesses high algorithmic simplicity like K-means. However, it suffers from unreliable performances in clustering quality and efficiency, both heavily influenced by the choice of initial cluster centers. Random selection of initial centers is easy but may lead to poor clustering quality and large numbers of iterations. Thus, it is important to find method for selecting K-modes' initial cluster centers, which is efficient and produces high clustering quality.

For the first part of this work, we introduce a variant of the K-modes algorithm called Bisecting K-modes (BK-modes), which, by solving the initialization issue of the K-modes, improves both clustering quality and efficiency. The BK-modes works by splitting a dataset into multiple clusters iteratively with one cluster being chosen and bisected into two clusters in each iteration. The cluster that has the highest sum of distances of the data points to their cluster center is chosen to be bisected at each iteration. Once K clusters are produced, the centers of these K clusters are used as the initial K cluster centers for the K-modes algorithm. Experimental studies of the BK-Modes on using big categorical datasets were carried out and compared against the K-modes with multiple sets of random initial centers as well as the state-of-the-art method we found in our survey. The results indicated that the BK-modes is efficient and produces high clustering quality for all cases tested in the experiments, demonstrating BK-modes a high-performance method for clustering large categorical datasets.

In the second part of this work, we study the optimization process of the K-means algorithm after it has been passed through the bisecting process. The bisecting K-means and its variants were developed in an attempt to choose initial centers for the K-means to achieve consistently high-quality clusters. We observed that each bisection of a cluster during the successive bisecting process and in the K-means iterations after the bisecting process, possible reduction of iterations is possible while maintaining clustering quality. This motivation led us to devise a technique to limit the number of iterations. We implemented this iteration reduction technique and carried out experimental studies on five big datasets and observed that the proposed technique produces almost the same clustering quality as the traditional bisecting K-means while substantially improves clustering efficiency, rendering the proposed algorithm a high-performance method for clustering large numerical datasets.

Embargo status: Restricted until 09/2172. To request the author grant access, click on the PDF link to the left.

Clustering, K-Means, K-Modes, Big Data