Movatterモバイル変換

The package FCPS provides standardized access to state-of-the-artclustering algorithms, datasets defining common clustering challenges,and the estimation of the number of clusters. Additionally, the clustertendency can be investigated, the number of clusters estimated, and anappropriate accuracy can be computed for arbitrary labels.

In the following example, the high-dimensional leukemia dataset isloaded and a visualized:

The function ClusterPlotMDS used for the figure above provides incase of datasets with dimensionality higher than three a 3D projectionusing multidimensionl scaling of the R package smacof on CRAN . The usercan decide if the rgl package or the plotly package should be used .

Cluster Analysis with FCPS

In the following code a high-dimensional dataset is loaded. Theleukemia dataset provides a distance matrix instead of a data matrix.Without further adjustments the function AgglomerativeNestingClusteringcan be called with the correct number of clusters. The resultingclustering is stored in the list element Cls which is a named vector.The names are defined by the rows of the distance or data matrix. In thenext step, the user can rename the clustering to consecutive number from1 to 6 with 1 beeing the label of the cluster with the biggest size and6 beeing the cluster of the smallest size. The names will still matchall rownames of the data or distances. Besides the Cls element theoutput list CA stores the original object of the clustering. In the caseof hierarchical algorithms another list element stores the Dendrogramwhich can be visualized with ClusterDendrogram which is shown in thenext section.

library(FCPS)data('Leukemia')set.seed(123)ClusterNo=length(unique(Leukemia$Cls))CA=AgglomerativeNestingClustering(Leukemia$DistanceMatrix,ClusterNo)Cls=ClusterRenameDescendingSize(CA$Cls)sum(match(names(Cls),rownames(Leukemia$DistanceMatrix),nomatch =0)==0)#> [1] 0

Generating a Clustering Challenge

Any clustering challenge listed in Table~2 can be generated with anarbitrary sample size. Here, Chainlink is selected and visualized in thefigure below.

set.seed(600)library(FCPS)DataList=ClusterChallenge("Chainlink",SampleSize =750)Data=DataList$ChainlinkCls=DataList$ClsClusterPlotMDS(Data,Cls,Plotter3D ='plotly',main ="Chainlink")

ClusterCount(Cls)#> $UniqueClusters#> [1] 1 2#>#> $CountPerCluster#>   1   2#> 376 374#>#> $NumberOfClusters#> [1] 2#>#> $ClusterPercentages#> [1] 50.13333 49.86667

Remark: ClusterPlotMDS detects that the dataset has only threedimensions and instead of projecting the data, visualizes the tree givendimensions.

Testing for Cluster Tendency

The cluster tendency or so-called clusterability can be investigatedwith the ggplot2 syntax as follows for the example of Chainlink:

set.seed(600)library(FCPS)DataList=ClusterChallenge("Chainlink",SampleSize =750)Data=DataList$ChainlinkCls=DataList$Clslibrary(ggplot2)ClusterabilityMDplot(Data)+theme_bw()#> [1] "This MD-plot is typically for several features at once. By calling as.matrix(), it will be now used with one feature."#> NULL

The figure presents the result for the sample of Chainlink. The MDplot shows mulitmodality and the statistical testing agrees with the MDplot (p<0.01 that data has no cluster tendency). Therefore, thesample has a high cluster tendency.

Estimating the Number of Clusters

Lets assume that there is no prior knowledge about the Chainlink dataavailable and the hierarchical algorithm single linkage is selected.Looking at the dendrogram the highest change in fusion level is two.However, maybe each of the main clusters has two subclusters presentedin Figure 3. Therefore the function ClusterNoEstimation is used toinvestigate this assumption. Figure 4 presents the Fan plot of theamount of indicators preferring a specific number of clusters for thesample of the Chainlink dataset. Majority vote proposes the clusternumber 7 or 3 with the correct cluster number equal to 2 as the second.The appropriate number of clusters would be two because neither 7 or 3are present in the dendrogram. The following code uses a numerical datamatrix for the hierarchical clustering algorithm. If not set otherwise,internally the Euclidean distances computed by the parallelDist arecomputed and used. Furthermore, the fastcluster is used to compute thetree. The dendextend allows to color the branches user-specifc if theClusterDendrogram is used. The function ClusterNoEstimation expects anmatrix of clusterings, each column one “Cls” ordered in the range ofcluster numbers of interest. In this example, the range from 2 to 7 isinvestigated.

library(FCPS)set.seed(135)DataList=ClusterChallenge("Chainlink",SampleSize =900)Data=DataList$ChainlinkCls=DataList$ClsTree=HierarchicalClustering(Data,1,"SingleL")[[3]]ClusterDendrogram(Tree,4,main='Single Linkage')MaximumNunber=7clsm<-matrix(data =0,nrow =dim(Data)[1],ncol = MaximumNunber)for (iin2:(MaximumNunber+1)) {  clsm[,i-1]<-cutree(Tree,i)}out=ClusterNoEstimation(Data,ClsMatrix = clsm,MaxClusterNo = MaximumNunber,PlotIt =TRUE)

Accurate Comparison to a Given Ground Truth

Usually, clustering accuracy can either be computed only correctlyfor binary classifications or is computed per cluster as shown below.The latter does not allow for a straightforward comparison betweenalgorithms. Often a simple approach of computing the overall accuracy isprovided in packages, for example, in see . The following code outlineswhy the overall accuracy is not correct if computed straightforward. Thesolution is provided by the function ClusterAccuracy which calculatesthe correct accuracy of a clustering algorithm:

library(FCPS)data("Leukemia")Distance=Leukemia$DistanceMatrixClassification=Leukemia$ClsCls=HierarchicalClustering(Distance,6,"SingleL")$Cls#Usual Computation Accuracy per Classcm=as.matrix(table(Cls,Classification))diag(cm)/rowSums(cm)#>          1          2          3          4          5          6#> 0.06666667 0.00000000 1.00000000 1.00000000 1.00000000 1.00000000# Usual overall Accuracysum(diag(cm))/sum(cm)#> [1] 0.7797834#e.g.#MLmetrics::Accuracy(Cls,Classification)#Correct ComputationClusterAccuracy(Cls,Classification)#> [1] 0.9963899cm#>    Classification#> Cls   1   2   3   4   5   6#>   1   1  14   0   0   0   0#>   2   0   0 108   0   0   0#>   3   0   0   1   0   0   0#>   4   0   0   0 266   0   0#>   5   0   0   0   0 163   0#>   6   0   0   0   0   0   1

Further Functionality

This section outlines further functionalities focussing on thepossibilities in cases for which the definitions in Table 3 are notmet.
The user can transform factors to numerical vectors using the functionClusterCreateClassification and perform simple cluster-based evaluationsper column of data or used the created Cls otherwise. In the examplebelow, the mean per cluster and feature is computed withClusterApply:

library(datasets)library(FCPS)Iris=datasets::irisData=as.matrix(Iris[,1:4])SomeFactors=Iris$SpeciesV=ClusterCreateClassification(SomeFactors)Cls=V$ClsV$ClusterNames#>            1            2            3#>     "setosa" "versicolor"  "virginica"ClusterApply(Data,mean,Cls)#> $UniqueClusters#>   Sepal.Length Sepal.Width Petal.Length Petal.Width#> 1        5.006       3.428        1.462       0.246#> 2        5.936       2.770        4.260       1.326#> 3        6.588       2.974        5.552       2.026#>#> $meanPerCluster#> [1] "1" "2" "3"

The same computations are possible for distance matrices. Thefunction ClusterApply can also be used to transform distances to data.Exemplary the tetragonula dataset is loaded from the prablcus package .The dataset consists of a data frame with 236 cases and 13 stringfeatures. For a brief overview about the data please see . The computeddistance matrix can be used in this package directly or via MDStransformation to a numerical matrix:

suppressPackageStartupMessages(library('prabclus',quietly =TRUE))data(tetragonula)#Generated Specific Distance Matrixta<-alleleconvert(strmatrix=as.matrix(tetragonula[1:236,]))tai<-alleleinit(allelematrix=ta,distance="none")Distances=alleledist((unbuild.charmatrix(tai$charmatrix,236,13)),236,13)Cls=rep(1,nrow(Distances))DataTrans=ClusterApply(Distances,identity,Cls)$identityPerClusterdim(DataTrans)#> NULLdim(Distances)#> [1] 236 236

Summary

Fifty-four conventional clustering algorithms are provided in theR}package FCPS on CRAN with consistent input and output. This enablesthe user to try out many algorithms swiftly. Additionally, 26statistical approaches for the estimation of the number of clusters, aswell as the mirrored density plot (MD-plot) of clusterability, areprovided. Moreover, the fundamental clustering problems suite (FCPS)offers a variety of clustering challenges any algorithm should handlewhen facing real-world data.