Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Silhouette (clustering)

From Wikipedia, the free encyclopedia
Quality measure in cluster analysis

Silhouette is a method of interpretation and validation of consistency withinclusters of data. The technique provides a succinct graphical representation of how well each object has been classified.[1] It was proposed by Belgian statisticianPeter Rousseeuw in 1987.

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette value ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. A clustering with an average silhouette width of over 0.7 is considered to be "strong", a value over 0.5 "reasonable", and over 0.25 "weak". However, with an increasing dimensionality of the data, it becomes difficult to achieve such high values because of thecurse of dimensionality, as the distances become more similar.[2] The silhouette score is specialized for measuring cluster quality when the clusters are convex-shaped, and may not perform well if the data clusters have irregular shapes or are of varying sizes.[3] The silhouette value can be calculated with anydistance metric, such asEuclidean distance orManhattan distance.

Definition

[edit]
A plot showing silhouette scores from three types of animals from the Zoo dataset as rendered byOrange data mining suite. At the bottom of the plot, silhouette identifies dolphin and porpoise as outliers in the group of mammals.

Assume the data have been clustered via any technique, such ask-medoids ork-means, intok{\displaystyle k} clusters.

For data pointiCi{\displaystyle i\in C_{i}} (data pointi{\displaystyle i} in the clusterCi{\displaystyle C_{i}}), calculatea(i){\displaystyle a(i)}, the average distance thati{\displaystyle i} is from all other points in that cluster:

a(i)=1|Ci|1jCi,ijd(i,j){\displaystyle a(i)={\frac {1}{|C_{i}|-1}}\sum _{j\in C_{i},i\neq j}d(i,j)}

where|Ci|{\displaystyle |C_{i}|} is the number of points belonging to clusterCi{\displaystyle C_{i}}, andd(i,j){\displaystyle d(i,j)} is the distance between data pointsi{\displaystyle i} andj{\displaystyle j} in the clusterCi{\displaystyle C_{i}} (we divide by|Ci|1{\displaystyle |C_{i}|-1} because the distanced(i,i){\displaystyle d(i,i)} is not included in the sum).a(i){\displaystyle a(i)} can be interpreted as a measure of how welli{\displaystyle i} is assigned to its cluster (the smaller the value, the better the assignment).

We then define the mean dissimilarity of pointi{\displaystyle i} to some clusterCj{\displaystyle C_{j}} as the mean of the distance fromi{\displaystyle i} to all points inCj{\displaystyle C_{j}} (whereCjCi{\displaystyle C_{j}\neq C_{i}}).

For each data pointiCi{\displaystyle i\in C_{i}}, we now defineb(i){\displaystyle b(i)} as the average distance betweeni{\displaystyle i} and the points in the closest cluster (hence: "min") thati{\displaystyle i} does not belong to:

b(i)=minji1|Cj|lCjd(i,l){\displaystyle b(i)=\min _{j\neq i}{\frac {1}{|C_{j}|}}\sum _{l\in C_{j}}d(i,l)}

The cluster with the smallest mean dissimilarity is said to be the "neighboring cluster" ofi{\displaystyle i} because it is the next best fit cluster for pointi{\displaystyle i}.

We now define asilhouette (value) of one data pointi{\displaystyle i}

s(i)=b(i)a(i)max{a(i),b(i)}{\displaystyle s(i)={\frac {b(i)-a(i)}{\max\{a(i),b(i)\}}}}, if|Ci|>1{\displaystyle |C_{i}|>1}

and

s(i)=0{\displaystyle s(i)=0}, if|Ci|=1{\displaystyle |C_{i}|=1},

which can also be written as

s(i)={1a(i)b(i), if a(i)<b(i)0,if a(i)=b(i)b(i)a(i)1, if a(i)>b(i){\displaystyle s(i)={\begin{cases}1-{\frac {a(i)}{b(i)}},&{\mbox{ if }}a(i)<b(i)\\0,&{\mbox{if }}a(i)=b(i)\\{\frac {b(i)}{a(i)}}-1,&{\mbox{ if }}a(i)>b(i)\\\end{cases}}}

From the above definition,s(i){\displaystyle s(i)} is bounded to the interval[1,1]{\displaystyle [-1,1]}, i.e.1s(i)1.{\displaystyle -1\leq s(i)\leq 1.}

Note thata(i){\displaystyle a(i)} is not clearly defined for clusters with size = 1, in which case we sets(i)=0{\displaystyle s(i)=0}. This choice is arbitrary, but neutral in the sense that it is at the midpoint of the bounds, -1 and 1.[1]

Fors(i){\displaystyle s(i)} to be close to 1 we requirea(i)b(i){\displaystyle a(i)\ll b(i)}. Asa(i){\displaystyle a(i)} is a measure of how dissimilari{\displaystyle i} is to its own cluster, a small value means it is well matched. Furthermore, a largeb(i){\displaystyle b(i)} implies thati{\displaystyle i} is badly matched to its neighbouring cluster. Thus ans(i){\displaystyle s(i)} close to 1 means that the data is appropriately clustered. Ifs(i){\displaystyle s(i)} is close to -1, then by the same logic we see thati{\displaystyle i} would be more appropriate if it was clustered in its neighbouring cluster. Ans(i){\displaystyle s(i)} near zero means that the datum is on the border of two natural clusters.

The means(i){\displaystyle s(i)} over all points of a cluster is a measure of how tightly grouped all the points in the cluster are. Thus the means(i){\displaystyle s(i)} over all data of the entire dataset is a measure of how appropriately the data have been clustered. If there are too many or too few clusters, as may occur when a poor choice ofk{\displaystyle k} is used in the clustering algorithm (e.g.,k-means), some of the clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and means may be used to determine the natural number of clusters within a dataset. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster specific.[4]

Kaufman et al. introduced the termsilhouette coefficient for the maximum value of the means(i){\displaystyle s(i)} over all data of the entire dataset,[5] i.e.,

SC=maxks~(k),{\displaystyle SC=\max _{k}{\tilde {s}}\left(k\right),}

wheres~(k){\displaystyle {\tilde {s}}\left(k\right)} represents the means(i){\displaystyle s(i)} over all data of the entire dataset for a specific number of clustersk{\displaystyle k}. The silhouette coefficient describes the best possible clustering possible for a given number of clusters, as measured by the highest averagesilhouette score for all points in the dataset.

Simplified and medoid silhouette

[edit]

Computing the silhouette coefficient needs allO(N2){\displaystyle {\mathcal {O}}(N^{2})} pairwise distances, making this evaluation much more costly than clustering with k-means. For a clustering with centersμCI{\displaystyle \mu _{C_{I}}} for each clusterCI{\displaystyle C_{I}}, we can use the following simplified Silhouette for each pointiCI{\displaystyle i\in C_{I}} instead, which can be computed using onlyO(Nk){\displaystyle {\mathcal {O}}(Nk)} distances:

a(i)=d(i,μCI){\displaystyle a'(i)=d(i,\mu _{C_{I}})} andb(i)=minCJCId(i,μCJ){\displaystyle b'(i)=\min _{C_{J}\neq C_{I}}d(i,\mu _{C_{J}})},

which has the additional benefit thata(i){\displaystyle a'(i)} is always defined, then define accordingly the simplified silhouette and simplified silhouette coefficient[6]

s(i)=b(i)a(i)max{a(i),b(i)}{\displaystyle s'(i)={\frac {b'(i)-a'(i)}{\max\{a'(i),b'(i)\}}}}
SC=maxk1Nis(i){\displaystyle SC'=\max _{k}{\frac {1}{N}}\sum _{i}s'\left(i\right)}.

If the cluster centers aremedoids (as in k-medoids clustering) instead of arithmetic means (as in k-means clustering), this is also called the medoid-based silhouette[7] or medoid silhouette.[8]

If every object is assigned to the nearest medoid (as in k-medoids clustering), we know thata(i)b(i){\displaystyle a'(i)\leq b'(i)}, and hences(i)=b(i)a(i)b(i)=1a(i)b(i){\displaystyle s'(i)={\frac {b'(i)-a'(i)}{b'(i)}}=1-{\frac {a'(i)}{b'(i)}}}.[8]

Silhouette clustering

[edit]

Instead of using the average silhouette to evaluate a clustering obtained from, e.g., k-medoids or k-means, we can try to directly find a solution that maximizes the Silhouette. We do not have a closed form solution to maximize this, but it will usually be best to assign points to the nearest cluster as done by these methods. Van der Laan et al.[7] proposed to adapt the standard algorithm for k-medoids, PAM, for this purpose and call this algorithm PAMSIL:

  1. Choose initial medoids by using PAM
  2. Compute the average silhouette of this initial solution
  3. For each pair of a medoidm and a non-medoidx
    1. swapm andx
    2. compute the average silhouette of the resulting solution
    3. remember the best swap
    4. un-swapm andx for the next iteration
  4. Perform the best swap and return to 3, otherwise stop if no improvement was found.

The loop in step 3 is executed forO(Nk){\displaystyle {\mathcal {O}}(Nk)} pairs, and involves computing the silhouette inO(N2){\displaystyle {\mathcal {O}}(N^{2})}, hence this algorithm needsO(N3ki){\displaystyle {\mathcal {O}}(N^{3}ki)} time, wherei is the number of iterations.

Because this is a fairly expensive operation, the authors propose to also use the medoid-based silhouette, and call the resulting algorithm PAMMEDSIL.[7] It needsO(N2k2i){\displaystyle {\mathcal {O}}(N^{2}k^{2}i)} time.

Batool et al. propose a similar algorithm under the name OSil, and propose a CLARA-like sampling strategy for larger data sets, that solves the problem only for a sub-sample.[9]

By adopting recent improvements to the PAM algorithm, FastMSC reduces the runtime using the medoid silhouette to justO(N2i){\displaystyle {\mathcal {O}}(N^{2}i)}.[8]

By starting with a maximum number of clusterskmax and iteratively removing the worst center (in terms of a change in silhouette) and re-optimizing, the best (highest medoid silhouette) clustering can be automatically determined. As data structures can be reused, this reduces the computation cost substantially over repeatedly running the algorithm for different numbers of clusters.[10]This algorithm needs pairwise distances and is typically implemented with a pairwise distance matrix. TheO(N2){\displaystyle {\mathcal {O}}(N^{2})} memory requirement is the main limiting factor for applying this to very large data sets.

Density-based Silhouette coefficient: the DBCV index

[edit]

The traditional Silhouette score can be misleading when used to assess nested, concave-shaped clusters.In these contexts, a density-based version of the Silhouette score was proposed in 2014 by David Moulavi and colleagues in their work, and is calledDBCV index.[11]

See also

[edit]

Further reading

[edit]

References

[edit]
  1. ^abPeter J. Rousseeuw (1987)."Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis".Computational and Applied Mathematics.20:53–65.doi:10.1016/0377-0427(87)90125-7.
  2. ^Beyer, Kevin; Goldstein, Jonathan; Ramakrishnan, Raghu; Shaft, Uri (1999). "When is "nearest neighbor" meaningful?".Database Theory—ICDT'99: 7th International Conference Jerusalem, Israel, January 10--12, 1999 Proceedings 7.
  3. ^Monshizadeh, Mehrnoosh; Khatri, Vikramajeet; Kantola, Raimo; Yan, Zheng (2022-11-01)."A deep density based and self-determining clustering approach to label unknown traffic".Journal of Network and Computer Applications.207 103513.doi:10.1016/j.jnca.2022.103513.ISSN 1084-8045.However, both measures [silhouette coefficient and edge correlation] prefer convex-shaped clusters and cannot adapt to all cluster shapes produced by DBSCAN.
  4. ^R.C. de Amorim, C. Hennig (2015). "Recovering the number of clusters in data sets with noise features using feature rescaling factors".Information Sciences.324:126–145.arXiv:1602.06989.doi:10.1016/j.ins.2015.06.039.S2CID 315803.
  5. ^Leonard Kaufman;Peter J. Rousseeuw (1990).Finding groups in data : An introduction to cluster analysis. Hoboken, NJ: Wiley-Interscience. p. 87.doi:10.1002/9780470316801.ISBN 9780471878766.
  6. ^Hruschka, E.R.; de Castro, L.N.; Campello, R.J.G.B. (2004).Evolutionary Algorithms for Clustering Gene-Expression Data. Fourth IEEE International Conference on Data Mining (ICDM'04). IEEE. pp. 403–406.doi:10.1109/ICDM.2004.10073.
  7. ^abcVan der Laan, Mark; Pollard, Katherine; Bryan, Jennifer (2003)."A new partitioning around medoids algorithm".Journal of Statistical Computation and Simulation.73 (8):575–584.doi:10.1080/0094965031000136012.ISSN 0094-9655.S2CID 17437463.
  8. ^abcLenssen, Lars; Schubert, Erich (2022).Clustering by Direct Optimization of the Medoid Silhouette. International Conference on Similarity Search and Applications. pp. 190–204.arXiv:2209.12553.doi:10.1007/978-3-031-17849-8_15. Retrieved2022-10-20.
  9. ^Batool, Fatima; Hennig, Christian (2021)."Clustering with the Average Silhouette Width".Computational Statistics & Data Analysis.158 107190.arXiv:1910.11339.doi:10.1016/j.csda.2021.107190.S2CID 219260336.
  10. ^Lenssen, Lars; Schubert, Erich (2024-02-01)."Medoid Silhouette clustering with automatic cluster number selection".Information Systems.120 102290.arXiv:2309.03751.doi:10.1016/j.is.2023.102290.ISSN 0306-4379.
  11. ^Moulavi, David; Jaskowiak, Pablo A.; Campello, Ricardo J. G. B.; Zimek, Arthur; Sander, Jörg (2014), "Density-Based Clustering Validation",Proceedings of the 2014 SIAM International Conference on Data Mining(PDF), SIAM, pp. 839–847,doi:10.1137/1.9781611973440.96,ISBN 978-1-61197-344-0
Machine learning evaluation metrics
Regression
Classification
Clustering
Ranking
Computer vision
NLP
Deep learning
Recommender system
Similarity
Retrieved from "https://en.wikipedia.org/w/index.php?title=Silhouette_(clustering)&oldid=1335669403"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp