Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

CURE algorithm

From Wikipedia, the free encyclopedia
Data clustering algorithm
Part of a series on
Machine learning
anddata mining
Journals and conferences

CURE (Clustering Using REpresentatives) is an efficientdata clustering algorithm for largedatabases[citation needed]. Compared withK-means clustering it is morerobust tooutliers and able to identify clusters having non-spherical shapes and size variances.

Drawbacks of traditional algorithms

[edit]

The popularK-means clustering algorithm minimizes thesum of squared errors criterion:

E=i=1kpCi(pmi)2,{\displaystyle E=\sum _{i=1}^{k}\sum _{p\in C_{i}}(p-m_{i})^{2},}

Given large differences in sizes or geometries of different clusters, the square error method could split the large clusters to minimize the square error, which is not always correct. Also, with hierarchic clustering algorithms these problems exist as none of the distance measures between clusters (dmin,dmean{\displaystyle d_{min},d_{mean}}) tend to work with different cluster shapes. Also therunning time is high when n is large.

The problem with theBIRCH algorithm is that once the clusters are generated after step 3, it uses centroids of the clusters and assigns eachdata point to the cluster with the closest centroid.[citation needed] Using only the centroid to redistribute the data has problems when clusters lack uniform sizes and shapes.

CURE clustering algorithm

[edit]

To avoid the problems with non-uniform sized or shaped clusters, CURE employs ahierarchical clustering algorithm that adopts amiddle ground between the centroid based and all point extremes. In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction α. The scattered points after shrinking are used as representatives of the cluster. The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE to correctly identify the clusters and makes it less sensitive to outliers.

Running time is O(n2 logn), making it rather expensive, andspace complexity is O(n).

The algorithm cannot be directly applied to large databases because of the high runtime complexity. Enhancements address this requirement.

  • Random sampling :random sampling supports large data sets. Generally therandom sample fits inmain memory. The random sampling involves atrade off between accuracy and efficiency.
  • Partitioning : The basic idea is to partition thesample space intop partitions. Each partition containsn/p elements. The first pass partially clusters each partition until the final number of clusters reduces ton/pq for some constant q ≥ 1. A second clustering pass onn/q partially clusters partitions. For the second pass only the representative points are stored since the merge procedure only requires representative points of previous clusters before computing the representative points for the merged cluster. Partitioning the input reduces the execution times.
  • Labeling data on disk : Given only representative points fork clusters, the remaining data points are also assigned to the clusters. For this a fraction of randomly selected representative points for each of thek clusters is chosen and data point is assigned to the cluster containing the representative point closest to it.

Pseudocode

[edit]

CURE (no. of points,k)

Input : A set of points S

Output :k clusters

  • For every cluster u (each input point), in u.mean and u.rep store the mean of the points in the cluster and a set ofc representative points of the cluster (initiallyc = 1 since each cluster has one data point). Also u.closest stores the cluster closest to u.
  • All the input points are inserted into ak-d tree T
  • Treat each input point as separate cluster, compute u.closest for each u and then insert each cluster into the heap Q. (clusters are arranged in increasing order of distances between u and u.closest).
  • While size (Q) >k
  • Remove the top element of Q (say u) and merge it with its closest cluster u.closest (say v) and compute the new representative points for the merged cluster w.
  • Remove u and v from T and Q.
  • For all the clusters x in Q, update x.closest and relocate x
  • insert w into Q
  • repeat

Availability

[edit]
  • pyclustering open source library includes a Python and C++ implementation of CURE algorithm.

See also

[edit]

References

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=CURE_algorithm&oldid=1085332370"
Category:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp