The authors declare that they have no conflict of interest.
Spatial data mining (SDM) is searching important relationships and characteristics that can clearly exist in spatial databases. This content aims to compare object clustering algorithms for spatial data mining, before identifying the most efficient algorithm. To this end, this paper compare k-means, Partionning Around Medoids (PAM) and Clustering Large Applications based on RANdomized Search (CLARANS) algorithms based on computing time. Experimental results indicate that, CLARANS is very efficient and effective.
Spatial data mining is the discovery of interesting characteristics and patterns that may exist in large spatial databases. It can be used in many applications such as seismology, minefield detection and astronomy. Clustering, in spatial data mining, is a useful technique for grouping a set of objects into classes or clusters such that objects within a cluster have high similarity among each other, but are dissimilar to objects in other clusters. In the last few years , many effective and scalable clustering methods have been developed. These methods can be categorized into partitioning methods, hierarchal methods, density based methods, grid-based methods, model-based methods and constrained-based methods
mining has several objectives, it has an important role in:
Drawing interesting spatial features and patterns
Capturing the intrinsic relationships between spatial and non spatial data,
Presenting data compilance concisely and at higher conceptual levels, and helping reorganized spatial databases to accommodate data semantics, as well as to achieve better good results.
Data analysis used cluster analysis, which generates a set of data elements into groups (or clusters) so in the same group the elements are similar to each other and different from those in other groups
Methods of clustering can be broadly classified into groups, there are two clustering groups: partitioning clustering and hierarchical clustering. The first group presents several methods, such as K-means and self- organizing map (SOM)
In this paper, we are working on a comparative analysis of k-mean, Partionning Aroud Medoids (PAM) and Clustering Large Applications based on RANdomized Search (CLARANS) algorithm and for that, we are using Flame and Spiral datasets.
The paper is organized as follows: section 2 introduces clustering Algorithms Based On Partitioning. Section 3 present the CLARANS algorithm. Experimental results are presented in section 4. Section 5 concludes the paper with a discussion on ongoing works.
There are two basic types of clustering algorithms
The K-Means algorithm
K-means algorithm is as follows:
Input | |
n: total number of clusters | |
D: Data set | |
Output: n clusters. | |
a | for initial cluster center randomly choose n objects from data set D; |
b | do; |
c | based on the mean value of the objects in the cluster, (re) assign each similar object to the cluster; |
d | update each cluster means by calculating the mean value of objects for each cluster; |
e | until no change found; |
PAM is similar to K-means algorithm. Like k- means algorithm, PAM divides data sets in to groups but based on mediods whereas k-means is based on centroids. By using mediods we can reduce the dissimilarity of objects within a cluster. In PAM, first calculate the mediod, then assigned the object to the nearest mediod, which forms a cluster.
The basic idea of PAM algorithm is choosing an initial representative object (center) for each cluster at random. The remaining objects are assigned to the nearest cluster
In the remainder, we use:
Im: current medoid that is to be replaced ,
Ip :the new medoid to replace im,
Ij: other nonmedoid objects that may or may not need to be moved
Ij,2 : current medoid that is nearest to Ij.
Now, to formalize the effect of a swap between Im and Ip, PAM computes costs Ci for all nonmedoid objects Ij.
This equation always gives a nonnegative Ci, indicating that there is a nonnegative cost incurred in replacing Im with Ip.
Unlike in (1),Ci here can be positive or negative, depending on wether Ij is more similar to Im or to Ip
and is always negative. Combining the four cases above, the total cost of replacing Im with Ip is given by:
We now present PAM algorithm
1 | Select k representative objects arbitrarily. |
2 | Compute |
3 | Select the pair Im,Ip which corresponds to min(Im, Ip) |
4 | Otherwise, for each nonselected object, find the most similar representative object. Halt |
The experimental study show that PAM is efficient for small data sets, (e.g., 100 objects in 5 clusters) but is not sufficient to process medium and large data sets. For this reason a complexity analysis on PAM is necessary. This analysis motivates the development of CLARANS.
In this section we will show our clustering algorithm CLARANS. Compared to the revealed clustering methods, CLARANS is very effective and efficient (experimental results). CLARANS is a variant of PAM
Algorithm CLARANS
Enter numlocal and maxneighbor as input parameters. Initialize i to 1, and mincost to a large number
Set current to an arbitrary node in Gn,k .
Set j to 1 .
Random neighbor S of the current, and based on equation 5, calculate the cost differential of the two nodes. If S has a lower cost, set current to S, and go to step 3.
Otherwise, increment j by 1. If j maxneighbor, go to step 4.
Otherwise, when j > maxneighbor, compare the results if the first is lower than mincost, apply mincost on the cost of the current and apply bestnode on current
Increment i by 1. If i > numlocal, output bestnode and halt. Otherwise, go to step 2.
In order to evaluate CLARANS in practice, we compare its performance with that of different k-medoids clustering techniques, using the dataset. The three algorithms were implemented using Python programmining language on PC, Intel Core i5 CPU (2.40 GHz) with 8GB RAM, Windows 10.
We start by implementing the k-means algorithm based on data distribution models using this algorithm, we want to distribute the data to a precise number of clusters. To achieve this task, we have chosen a dataset (of 240 data). In
For the PAM algorithm, we use the same dataset from the previous part of k-means (240 data).
The CLARANS algorithm consists in finding the most representative objects in each of the clusters for which these objects cost (for example, the sum of the distance to others belonging to the same cluster) is minimal. These objects are called medoids. After selecting the medoids, each of the grouped objects goes to the cluster whose representative is the medoid closest to that object. CLARANS is based on a random search for medoid candidates, a cost calculation and a comparison with the current best local solution.
This action is repeated until the number of randomly selected objects with a cost greater than the current one the lowest local cost will not exceed ‘ neighbors ‘. Then the cost of the best local solution is compared to the best global solution achieved to date and if it is smaller then local solution becomes global. Whether the local solution has become global or not, the counter is incremented by 1, and the algorithm is run again until the number of such passes reaches ‘local minimum’ values. Then, the current best overall solution is returned to run this algorithm :
Number of clusters in the data is 3
Number of objects (points / polygons) to generate, the value is 240.
Number of clusters / medoids to search, the default is the value of the numlocal parameter, the value is 3.
Maximal number of neighbors in claster 15
Points in two-dimensional space with x and y coordinates implemented as a class with x and y attributes were adopted as a data model. The second spatial data model used for analysis are polygons also defined in two- dimensional space, implemented as a class with the vertices attribute being a list containing the point objects belonging to the polygon.
The cost has been implemented:
for points as the sum of the distances between all the points in the individual clusters and the medoids in these clusters.
for polygons, as the sum of the smallest distance between the vertices of all polygons in the individual groups and the medoid polygons in these groups.
Execution time has been measured by the required time for forming clusters by the clustering algorithms k- means, PAM and CLARANS. Time required for each algorithm has been recorded and results have been drawn. Computing time (in seconds) are shown in
Dataset | PAM | Kmeans | CLARANS |
Flame (240) | 13.8 | 0.5 | 0.21 |
Spiral (312) | 56.4 | 1.69 | 0.36 |
Dataset | PAM | Kmeans | CLARANS |
Flame (240) | 363.63 | 3.34 | 0.6 |
Spiral (312) | 304.53 | 6.11 | 0.68 |
Comparing computing time of the clustering algorithms hows that the CLARANS algorithm takes less time than others, that means this work indicates CLARANS performance is better than others according to time chart.
All the proposed algorithms has been implemented using programming language python. From the experimental results, the graphical representation shows that CLARANS has a low execution time value compared to other algorithms for (k = 3 (0.21; 0.36)), (k = 10 (0.60; 0.68)) even if we have to change the value of k, and also the size of dataset but PAM has a longer execution time.
for the execution time of Kmeans has an intermediate value between the two algorithms.
the number of k clusters has an influence on the execution time.
the size of dataset has an effect on the variation of the execution time. Clarans is the best clustering algorithm compared to PAM and K-MEANS.
Clustering is an unsupervised method aimed at grouping and creating a collection of similar objects within the same group . . This work is about algorithms of clustering and their effectiveness for spatial data mining. The experimental result indicates that CLARANS algorithm reduces the execution time and gives better clusters when compared with other algorithms. But it’s not always the case. CLARANS doesn’t give better cluster. So future research work will be first focused on developing an algorithm, which will increase the performance of segmentation process. Further work also lies in this area. A result line can be drawn from this experimental work and is it is that CLARANS algorithm has better efficiency than PAM and K-MEANS algorithm.