Movatterモバイル変換


[0]ホーム

URL:


Functionality of the ClusterR package

Lampros Mouselimis

2025-10-18

Clusteranalysis or clustering is the task of grouping a set of objects insuch a way that objects in the same group (called a cluster) are moresimilar (in some sense or another) to each other than to those in othergroups (clusters). It is the main task of exploratory data mining, and acommon technique for statistical data analysis, used in many fields,including machine learning, pattern recognition, image analysis,information retrieval, bioinformatics, data compression, and computergraphics.

The most prominent examples of clustering algorithmsareConnectivity-based clustering (hierarchical clustering),Centroid-based clustering (k-means, k-medoids,…),Distribution-based clustering (Gaussian mixture models) andDensity-based clustering (DBSCAN, OPTICS,…).

TheClusterR package consists of centroid-based (k-means,mini-batch-kmeans, k-medoids) and distribution-based (GMM) clusteringalgorithms. Furthermore, the package offers functions to

The following notes and examples explain the functionality of theclustering algorithms, which are part of the ClusterRpackage.

Gaussian Mixture Models (GMM)

GaussianMixture Models are a probabilistic model for representing normallydistributed subpopulations within an overall population. A Gaussianmixture model is parameterized by two types of values, the mixturecomponent weights and thecomponent means andcovariances (for themultivariate case). If the numberof components is known, expectation maximization is the technique mostcommonly used to estimate the mixture model’s parameters.

TheGMM function in theClusterR package isan R implementation of theArmadillolibrary class for modeling data as a Gaussian Mixture Model (GMM),under the assumption ofdiagonal covariance matrices. A numberof function parameters can be tuned, among others thegaussian_comps, thedist_mode (eucl_dist, maha_dist),theseed_mode (static_subset, random_subset, static_spread,random_spread), thekm_iter and theem_iter (moreinformation about the parameters can be found in the packagedocumentation). I’ll illustrate theGMM function using thesynthetic datadietary_survey_IBS,

library(ClusterR)data(dietary_survey_IBS)dim(dietary_survey_IBS)
## [1] 400  43
X= dietary_survey_IBS[,-ncol(dietary_survey_IBS)]# data (excluding the response variable)y= dietary_survey_IBS[,ncol(dietary_survey_IBS)]# the response variabledat=center_scale(X,mean_center = T,sd_scale = T)# centering and scaling the data
gmm=GMM(dat,2,dist_mode ="maha_dist",seed_mode ="random_subset",km_iter =10,em_iter =10,verbose = F)# predict centroids, covariance matrix and weightspr=predict(gmm,newdata = dat)


TheGMM function, initially, returns thecentroids,thecovariance matrix ( where each row of the matrix representsa diagonal covariance matrix), theweights and thelog-likelihoods for each gaussian component. Then, thepredict function takes theGMM model and returns themost probable clusters.

In addition to the previous mentioned functions, theOptimal_Clusters_GMM can be utilized to estimate thenumber of clusters of the data using either theAIC (Akaikeinformation) or theBIC (Bayesian information) criterion,

opt_gmm=Optimal_Clusters_GMM(dat,max_clusters =10,criterion ="BIC",dist_mode ="maha_dist",seed_mode ="random_subset",km_iter =10,em_iter =10,var_floor =1e-10,plot_data = T)



In case of model selection, among a specific number of models, themodel with the lowestBICshould be preferred, which is true here for a number of clusters equalto 2.

Assuming that true labels are available, then one could use theexternal_validation methods (rand_index,adjusted_rand_index,jaccard_index,fowlkes_Mallows_index,mirkin_metric,purity,entropy,nmi (normalized mutual information) andvar_info (variation of information) to validate the outputclusters,

res=external_validation(dietary_survey_IBS$class, pr$cluster_labels,method ="adjusted_rand_index",summary_stats = T)res#### ----------------------------------------## purity                         : 1## entropy                        : 0## normalized mutual information  : 1## variation of information       : 0## ----------------------------------------## specificity                    : 1## sensitivity                    : 1## precision                      : 1## recall                         : 1## F-measure                      : 1## ----------------------------------------## accuracy OR rand-index         : 1## adjusted-rand-index            : 1## jaccard-index                  : 1## fowlkes-mallows-index          : 1## mirkin-metric                  : 0## ----------------------------------------


and if thesummary_stats parameter is set to TRUE then italso returns thespecificity,sensitivity,precision,recall andF-measure metrics.

k-means

k-meansclustering is a method of vector quantization, originally from signalprocessing, that is popular for cluster analysis in data mining. k-meansclustering aims to partitionn observations intokclusters in which each observation belongs to the cluster with thenearest mean, serving as a prototype of the cluster. This results in apartitioning of the data space intoVoronoi cells.The most common algorithm uses an iterative refinement technique. Due toits ubiquity, it is often called the k-means algorithm; it is alsoreferred to asLloyd’salgorithm, particularly in the computer science community.

The ClusterR package provides two different k-means functions, theKMeans_arma, which is an R implementation of the k-meansarmadillo library and theKMeans_rcpp which uses theRcppArmadillo package. Both functions come to the same output results,however, they return different features which I’ll explain in the nextcode chunks.

KMeans_arma

TheKMeans_arma is faster than the KMeans_rcpp function,however, it initially outputs only the centroids for a specific numberof clusters. Furthermore, the number of columns of the data should belarger than the number of clusters, otherwise, it raises an error. Theclustering will run faster on multi-core machines when OpenMP is enabled(eg. -fopenmp in GCC). The algorithm is initialized once and 10iterations are typically sufficient for convergence. The initialcentroids are seeded using one ofkeep_existing,static_subset,random_subset,static_spreadandrandom_spread. If the seed_mode equals to keep_existingthen the user should supply a matrix of centroids.

I’ll reducethe dimensions of the dietary_survey_IBS data using PCA and particularlytheprincomp function of thestats package, so that a2-dimensional plot of the resulted clusters is possible,

pca_dat= stats::princomp(dat)$scores[,1:2]km=KMeans_arma(pca_dat,clusters =2,n_iter =10,seed_mode ="random_subset",verbose = T,CENTROIDS =NULL)pr=predict_KMeans(pca_dat, km)table(dietary_survey_IBS$class, pr)class(km)='matrix'plot_2d(data = pca_dat,clusters =as.vector(pr),centroids_medoids =as.matrix(km))





KMeans_rcpp

As stated before theKMeans_rcpp function offers someadditional features in comparison to theKMeans_armafunction,


More details about KMeans_rcpp can be found in the packagedocumentation. I’ll explain the various parameters of theKMeans_rcpp using avectorquantization example and the OpenImageR package,

library(OpenImageR)im=readImage('elephant.jpg')# first resize the image to reduce the dimensionsim=resizeImage(im,75,75,method ='bilinear')imageShow(im)# plot the original image

im2=apply(im,3, as.vector)# vectorize RGB


# perform KMeans_rcpp clusteringkm_rc=KMeans_rcpp(im2,clusters =5,num_init =5,max_iters =100,initializer ='optimal_init',verbose = F)km_rc$between.SS_DIV_total.SS
## [1] 0.9873009
pr=predict(km_rc,newdata = im2)


The attributebetween.SS_DIV_total.SS is equal to(total_SSE - sum(WCSS_per_cluster)) / total_SSE. If there is nopattern of clustering then the between sum of squares will be a verysmall fraction of the total sum of squares, whereas if thebetween.SS_DIV_total.SS attribute is close to 1.0 then theobservations cluster pretty well.


getcent= km_rc$centroidsgetclust= km_rc$clustersnew_im= getcent[getclust, ]# each observation is associated with the nearby centroiddim(new_im)=c(nrow(im),ncol(im),3)# back-convertion to a 3-dimensional imageimageShow(new_im)


As a follow-up one can take advantage of theOptimal_Clusters_KMeans function (which indirectly usesKMeans_rcpp) to estimate the optimal number of clusters. The availablecriteria arevariance_explained,WCSSE(within-cluster-sum-of-squared-error),dissimilarity,silhouette,distortion_fK,AIC,BICandAdjusted_Rsquared. More information on each criterion canbe found in the package documentation.

In the next code chunk I’ll use thedistortion_fK criterion,which is fully described in the “Selection of K in K-means clustering,Pham., Dimov., Nguyen., (2004)” paper,

opt=Optimal_Clusters_KMeans(im2,max_clusters =10,plot_clusters = T,criterion ='distortion_fK',fK_threshold =0.85,initializer ='optimal_init',tol_optimal_init =0.2)


Values below the fixed threshold (here fK_threshold = 0.85) could berecommended for clustering, however there are multiple optimalclusterings and this highlights the fact that f(K) should only be usedto suggest a guide value for the number of clusters and the finaldecision as to which value to adopt has to be left at the discretion ofthe user.

Mini-batch-kmeans


Mini-batch-kmeans(http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf)is a variation of the classical k-means algorithm. It is particularlyuseful for big data sets because rather than using the whole data (ask-means does) it uses mini-batches from random data samples to optimizethe objective function.

The parameters of theMiniBatchKmeans algorithm arealmost the same as for the KMeans_rcpp function in the ClusterR package.The most important differences are thebatch_size (the size ofthe mini batches) and theinit_fraction (the percentage of datato use for the initialized centroids, which applies if the initializerequals to ‘kmeans++’ or ‘quantile_init’).

I’ll take advantage of thevector quantization example toshow the differences in computation time and output quality between theKMeans_rcpp and MiniBatchKmeans functions,

im_d=readImage('dog.jpg')# first resize the image to reduce the dimensionsim_d=resizeImage(im_d,350,350,method ='bilinear')imageShow(im_d)# plot the original image

im3=apply(im_d,3, as.vector)# vectorize RGBdim(im3)# initial dimensions of the data
## [1] 122500      3


First, we perform ak-means clustering,

start=Sys.time()km_init=KMeans_rcpp(im3,clusters =5,num_init =5,max_iters =100,initializer ='kmeans++',verbose = F)end=Sys.time()t= end- startcat('time to complete :', t,attributes(t)$units,'\n')
## time to complete : 3.333115 secs
getcent_init= km_init$centroidsgetclust_init= km_init$clustersnew_im_init= getcent_init[getclust_init, ]# each observation is associated with the nearby centroiddim(new_im_init)=c(nrow(im_d),ncol(im_d),3)# back-convertion to a 3-dimensional imageimageShow(new_im_init)


and then amini-batch-kmeans clustering,

start=Sys.time()km_mb=MiniBatchKmeans(im3,clusters =5,batch_size =20,num_init =5,max_iters =100,init_fraction =0.2,initializer ='kmeans++',early_stop_iter =10,verbose = F)pr_mb=predict(object = km_mb,newdata = im3)
## Warning: `predict_MBatchKMeans()` was deprecated in ClusterR 1.3.0.## ℹ Beginning from version 1.4.0, if the fuzzy parameter is TRUE the function##   'predict_MBatchKMeans' will return only the probabilities, whereas currently##   it also returns the hard clusters## ℹ The deprecated feature was likely used in the ClusterR package.##   Please report the issue at <https://github.com/mlampros/ClusterR/issues>.## This warning is displayed once every 8 hours.## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was## generated.
end=Sys.time()t= end- startcat('time to complete :', t,attributes(t)$units,'\n')
## time to complete : 1.088886 secs
getcent_mb= km_mb$centroidsnew_im_mb= getcent_mb[pr_mb, ]# each observation is associated with the nearby centroiddim(new_im_mb)=c(nrow(im_d),ncol(im_d),3)# back-convertion to a 3-dimensional imageimageShow(new_im_mb)


For a slight difference in the output quality, the mini-batch-kmeansreturns the output in average more than twice as fast as the classicalk-means.

K-Medoids


Thek-medoidsalgorithm (Kaufman, L., Rousseeuw, P., 1987) is a clustering algorithmrelated to the k-means algorithm and the medoid shift algorithm. Boththe k-means and k-medoids algorithms are partitional and both attempt tominimize the distance between points labeled to be in a cluster and apoint designated as the center of that cluster. In contrast to thek-means algorithm, k-medoids choosesdata points as centers(medoids or exemplars) and works with anarbitrary metrics ofdistances between data points. A useful tool for determining k isthe silhouette width. K-medoids is more robust to noise and outliers incomparison to k-means, because it minimizes a sum of pairwisedissimilarities instead of the sum of squared Euclidean distances. Amedoid can be defined as the object of a cluster whose averagedissimilarity to all the objects in the cluster is minimal, i.e. it is amost centrally located point in the cluster.

The most commonrealization of the k-medoid clustering is the Partitioning AroundMedoids (PAM) algorithm. PAM proceeds in two phases: BUILD and SWAP. Inthe BUILD phase, the algorithm searches for a good set of initialmedoids and in the SWAP phase all possible swaps between theBUILD-medoids and the observations take place so that there is nofurther decrease of the objective (Clustering in an Object-OrientedEnvironment, A.Struyf, M. Hubert, P. Rousseeuw., 1997).

In the ClusterR package, theCluster_Medoids andClara_Medoids functions correspond to PAM (partitioningaround medoids) and CLARA (clustering large applications) algorithms.

In the following code chunk, I’ll make use of themushroomdata to illustrate how k-medoids work with a distance metric other thanthe euclidean distance. The mushroom data consist of 23 categoricalattributes (including the class) and 8124 instances. More informationabout the data can be found in the package documentation.

Cluster_Medoids

TheCluster_Medoids function can also take - besides amatrix or data frame - a dissimilarity matrix as input. In the case ofthe mushroom data, where all the features are categorical (with two ormore unique values) it would be meaningful to use thegowerdistance. Thegower distance applies a different function toeach predictor depending on its type (numeric, ordered, factor). Thisdissimilarity measure is implemented in many R packages, among others inthe cluster package (daisy function) and in the FD package(gowdis function). I’ll take advantage of thegowdisfunction of the FD package as it also allows user-defined weights foreach separate predictor,

data(mushroom)X= mushroom[,-1]y=as.numeric(mushroom[,1])# convert the labels to numericgwd= FD::gowdis(X)# calculate the 'gower' distance for the factor variablesgwd_mat=as.matrix(gwd)# convert the distances to a matrixcm=Cluster_Medoids(gwd_mat,clusters =2,swap_phase =TRUE,verbose = F)
## Warning: The `seed` argument of `Cluster_Medoids()` is deprecated as of ClusterR 1.2.6.## ℹ The 'seed' parameter will be removed in version 1.4.0## This warning is displayed once every 8 hours.## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was## generated.


Non-Weigthed-K-medoids
adusted_rand_indexavg_silhouette_width
0.57335870.2545221


As mentioned before thegowdis function of the FD packageallows the user to give different weights to each separate variable. Theweights parameter can be tuned, for example by usingrandomsearch, in order to achieve better clustering results. For instance,by using the following weights for each separate variable one canimprove both theadjusted-rand-index (external validation) andtheaverage silhouette width (internal validation),

predictorsweights
cap_shape4.626
cap_surface38.323
cap_color55.899
bruises34.028
odor169.608
gill_attachment6.643
gill_spacing42.080
gill_size57.366
gill_color37.938
stalk_shape33.081
stalk_root65.105
stalk_surface_above_ring18.718
stalk_surface_below_ring76.165
stalk_color_above_ring27.596
stalk_color_below_ring26.238
veil_type0.000
veil_color1.507
ring_number37.314
ring_type32.685
spore_print_color127.870
population64.019
habitat44.519


weights=c(4.626,38.323,55.899,34.028,169.608,6.643,42.08,57.366,37.938,33.081,65.105,18.718,76.165,27.596,26.238,0.0,1.507,37.314,32.685,127.87,64.019,44.519)gwd_w= FD::gowdis(X,w = weights)# 'gower' distance using weightsgwd_mat_w=as.matrix(gwd_w)# convert the distances to a matrixcm_w=Cluster_Medoids(gwd_mat_w,clusters =2,swap_phase =TRUE,verbose = F)


Weigthed-K-medoids
adusted_rand_indexavg_silhouette_width
0.61976720.3000048



Clara_Medoids

CLARA(CLustering LARge Applications) is an obvious way to cluster largerdatasets. Instead of finding medoids for the entire data set - it wouldbe also infeasible to calculate the dissimilarity matrix - CLARA draws asmall sample from the data and applies the PAM algorithm to generate anoptimal set of medoids for the sample. The quality of the resultingmedoids is measured by the average dissimilarity between every object inthe entire data set and the medoid of its cluster.

TheClara_Medoids function in the ClusterR package followsthe same logic by applying theCluster_Medoids function to eachselected sample. TheClara_Medoids takes two additionalparameters, thesamples, and thesample_size. Thefirst one indicates the number of samples to draw from the data set,while the second one the fraction of the data to draw in each sampleiteration (a float number between 0.0 and 1.0). I have to point out thattheClara_Medoids function does not take a dissimilarity matrixas input, as theCluster_Medoids function does.

I’ll apply theClara_Medoids function to the previously usedmushroom data set by using thehamming distance as adissimilarity metric and I’ll compare the system time and results withthose of theCluster_Medoids function. Thehammingdistance is appropriate for themushroom data as it’sapplicable to discrete variables and it’s defined as the number ofattributes that take different values for two compared instances(Data Mining Algorithms: Explained using R, Pawel Cichosz, 2015,page 318).

cl_X= X# copy initial data# the Clara_Medoids function allows only numeric attributes# so first convert to numericfor (iin1:ncol(cl_X)) { cl_X[, i]=as.numeric(cl_X[, i]) }start=Sys.time()cl_f=Clara_Medoids(cl_X,clusters =2,distance_metric ='hamming',samples =5,sample_size =0.2,swap_phase =TRUE,verbose = F,threads =1)end=Sys.time()t= end- startcat('time to complete :', t,attributes(t)$units,'\n')
## time to complete : 3.565772 secs


hamming-Clara-Medoids
adusted_rand_indexavg_silhouette_width
0.57335870.2436067



start=Sys.time()cl_e=Cluster_Medoids(cl_X,clusters =2,distance_metric ='hamming',swap_phase =TRUE,verbose = F,threads =1)end=Sys.time()t= end- startcat('time to complete :', t,attributes(t)$units,'\n')
## time to complete : 16.3876 secs


hamming-Cluster-Medoids
adusted_rand_indexavg_silhouette_width
0.57335870.2545221



Using thehamming distance, both theClara_Medoidsand theCluster_Medoids functions return approximately the sameresult (comparable also with thegower distance results), onlythat theClara_Medoids function outputs more than four timesfaster than theCluster_Medoids for this particular dataset.

By using the object results of the last two code chunks one can alsoplot the silhouette widths using theSilhouette_Dissimilarity_Plot function. Worthmentioning here is that the dissimilarities and silhouette widths of theClara_Medoids function are based on the best-selected sampleand not on the entire data set, as is the case for theCluster_Medoids function.

# Silhouette Plot for the "Clara_Medoids" objectSilhouette_Dissimilarity_Plot(cl_f,silhouette =TRUE)
## [1] TRUE





# Silhouette Plot for the "Cluster_Medoids" objectSilhouette_Dissimilarity_Plot(cl_e,silhouette =TRUE)
## [1] TRUE






[8]ページ先頭

©2009-2025 Movatter.jp