Movatterモバイル変換

[0]ホーム

Jump to content

BIRCH

Edit links

From Wikipedia, the free encyclopedia

Clustering using tree-based data aggregation

This article is about the clustering algorithm. For the tree, seeBirch. For other uses, seeBirch (disambiguation).

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsuperviseddata mining algorithm used to performhierarchical clustering over particularly large data-sets.^[1] With modifications it can also be used to acceleratek-means clustering and Gaussian mixture modeling with theexpectation–maximization algorithm.^[2] An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metricdata points in an attempt to produce the best quality clustering for a given set of resources (memory andtime constraints). In most cases, BIRCH only requires a single scan of the database.

Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively",^[1] beatingDBSCAN by two months. The BIRCH algorithm received the SIGMOD 10 year test of time award in 2006.^[3]

Problem with previous methods

[edit]

Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a data-set was too large to fit inmain memory. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of additional IO (input/output) operations. Furthermore, most of BIRCH's predecessors inspect all data points (or all currently existing clusters) equally for each 'clustering decision' and do not perform heuristic weighting based on the distance between these data points.

Advantages with BIRCH

[edit]

It is local in that each clustering decision is made without scanning all data points and currently existing clusters.It exploits the observation that the data space is not usually uniformly occupied and not every data point is equally important.It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs.It is also an incremental method that does not require the wholedata set in advance.

Algorithm

[edit]

The BIRCH algorithm takes as input a set ofN data points, represented asreal-valued vectors, and a desired number of clustersK. It operates in four phases, the second of which is optional.

The first phase builds a clustering feature ( $C F {\displaystyle CF}$ ) tree out of the data points, a height-balancedtree data structure, defined as follows:

Given a set of N d-dimensional data points, theclustering feature $C F {\displaystyle CF}$ of the set is defined as the triple $CF=(N,{\overrightarrow {LS}},SS)$ , where
- ${\overrightarrow {LS}}=\sum _{i=1}^{N}{\overrightarrow {X_{i}}}$ is the linear sum.
- $SS=\sum _{i=1}^{N}({\overrightarrow {X_{i}}})^{2}$ is the square sum of data points.
Clustering features are organized in aCF tree, a height-balanced tree with two parameters:^{[clarification needed]}branching factor $B {\displaystyle B}$ and threshold $T {\displaystyle T}$ . Each non-leaf node contains at most $B {\displaystyle B}$ entries of the form $[CF_{i},child_{i}]$ , where $child_{i}$ is a pointer to its $i {\displaystyle i}$ thchild node and $CF_{i}$ the clustering feature representing the associated subcluster. Aleaf node contains at most $L {\displaystyle L}$ entries each of the form $[CF_{i}]$ . It also has two pointers prev and next which are used to chain all leaf nodes together. The tree size depends on the parameter $T {\displaystyle T}$ . A node is required to fit in a page of size $P {\displaystyle P}$ . $B {\displaystyle B}$ and $L {\displaystyle L}$ are determined by $P {\displaystyle P}$ . So $P {\displaystyle P}$ can be varied forperformance tuning. It is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster.

In the second step, the algorithm scans all the leaf entries in the initial $C F {\displaystyle CF}$ tree to rebuild a smaller $C F {\displaystyle CF}$ tree, while removing outliers and grouping crowded subclusters into larger ones. This step is marked optional in the original presentation of BIRCH.

In step three an existing clustering algorithm is used to cluster all leaf entries. Here an agglomerative hierarchical clustering algorithm is applied directly to the subclusters represented by their $C F {\displaystyle CF}$ vectors. It also provides the flexibility of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step a set of clusters is obtained that captures major distribution pattern in the data. However, there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 the centroids of the clusters produced in step 3 are used as seeds and redistribute the data points to its closest seeds to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier.

Calculations with the clustering features

[edit]

This sectionis missing information about BIRCH equations for Diameter D, Distances D0, D1, D3 and D4. Please expand the section to include this information. Further details may exist on thetalk page.(July 2023)

Given only the clustering feature $CF=[N,{\overrightarrow {LS}},SS]$ , the same measures can be calculated without the knowledge of the underlying actual values.

Centroid: ${\overrightarrow {C}}={\frac {\sum _{i=1}^{N}{\overrightarrow {X_{i}}}}{N}}={\frac {\overrightarrow {LS}}{N}}$
Radius: $R={\sqrt {\frac {\sum _{i=1}^{N}({\overrightarrow {X_{i}}}-{\overrightarrow {C}})^{2}}{N}}}={\sqrt {\frac {N\cdot {\overrightarrow {C}}^{2}+SS-2\cdot {\overrightarrow {C}}\cdot {\overrightarrow {LS}}}{N}}}={\sqrt {{\frac {SS}{N}}-({\frac {\overrightarrow {LS}}{N}})^{2}}}$
Average Linkage Distance between clusters $CF_{1}=[N_{1},{\overrightarrow {LS_{1}}},SS_{1}]$ and $CF_{2}=[N_{2},{\overrightarrow {LS_{2}}},SS_{2}]$ : $D_{2}={\sqrt {\frac {\sum _{i=1}^{N_{1}}\sum _{j=1}^{N_{2}}({\overrightarrow {X_{i}}}-{\overrightarrow {Y_{j}}})^{2}}{N_{1}\cdot N_{2}}}}={\sqrt {\frac {N_{1}\cdot SS_{2}+N_{2}\cdot SS_{1}-2\cdot {\overrightarrow {LS_{1}}}\cdot {\overrightarrow {LS_{2}}}}{N_{1}\cdot N_{2}}}}$

In multidimensional cases the square root should be replaced with a suitable norm.

BIRCH uses the distances DO to D3 to find the nearest leaf, then the radius R or the diameter D to decide whether to absorb the data into the existing leaf or whether to add a new leaf.

Numerical issues in BIRCH clustering features

[edit]

Unfortunately, there are numerical issues associated with the use of the term $S S {\displaystyle SS}$ in BIRCH. When subtracting ${\frac {SS}{N}}-{\big (}{\frac {\vec {LS}}{N}}{\big )}^{2}$ or similar in the other distances such as $D_{2}$ ,catastrophic cancellation can occur and yield a poor precision, and which can in some cases even cause the result to be negative (and the square root then become undefined).^[2] This can be resolved by using BETULA cluster features $CF=(N,\mu ,S)$ instead, which store the count $N {\displaystyle N}$ , mean $\mu$ , and sum of squared deviations instead based on numerically more reliableonline algorithms to calculate variance. For these features, a similar additivity theorem holds. When storing a vector respectively a matrix for the squared deviations, the resulting BIRCH CF-tree can also be used to accelerate Gaussian Mixture Modeling with theexpectation–maximization algorithm, besidesk-means clustering andhierarchical agglomerative clustering.

Instead of storing the linear sum and the sum of squares, we can instead store the mean and the squareddeviation from the mean in each cluster feature $CF'=(N,\mu ,S)$ ,^[4] where

$n {\displaystyle n}$ is the node weight (number of points)
$\mu$ is the node center vector (arithmetic mean, centroid)
$S {\displaystyle S}$ is the sum of squared deviations from the mean (either a vector, or a sum to conserve memory, depending on the application)

The main difference here is that S is computed relative to the center, instead of relative to the origin.

A single point $x {\displaystyle x}$ can be cast into a cluster feature $CF_{x}=(1,x,0)$ . In order to combine two cluster features $CF_{AB}=CF_{A}+CF_{B}$ , we use

$N_{AB}=N_{A}+N_{B}$
$\mu _{AB}=\mu _{A}+{\frac {N_{B}}{N_{AB}}}(\mu _{B}-\mu _{A})$ (incremental update of the mean)
$S_{AB}=S_{A}+S_{B}+N_{B}(\mu _{B}-\mu _{A})\circ (\mu _{B}-\mu _{AB})$ in vector form using theelement-wise product, respectively
$S_{AB}=S_{A}+S_{B}+N_{B}(\mu _{B}-\mu _{A})^{T}(\mu _{B}-\mu _{AB})$ to update a scalar sum of squared deviations

These computations use numerically more reliable computations (cf.online computation of the variance) that avoid the subtraction of two similar squared values. The centroid is simply the node center vector $\mu$ , and can directly be used for distance computations using, e.g., the Euclidean or Manhattan distances. The radius simplifies to $R={\sqrt {{\frac {1}{N}}S}}$ and the diameter to $D={\sqrt {{\frac {2}{N-1}}S}}$ .

We can now compute the different distances D0 to D4 used in the BIRCH algorithm as:^[4]

Euclidean distance $D_{0}=\|\mu _{A}-\mu _{B}\|$ and Manhattan distance $D_{1}=\|\mu _{A}-\mu _{B}\|_{1}$ are computed using the CF centers $\mu$
Inter-cluster distance $D_{2}={\sqrt {{\frac {1}{N_{A}}}S_{A}+{\frac {1}{N_{B}}}S_{B}+{\big \|}\mu _{A}-\mu _{B}{\big \|}^{2}}}$
Intra-cluster distance $D_{3}={\sqrt {{\frac {2}{N_{AB}(N_{AB}-1)}}\left(N_{AB}(S_{A}+S_{B})+N_{A}N_{B}{\big \|}\mu _{A}-\mu _{B}{\big \|}^{2}\right)}}$
Variance-increase distance $D_{4}={\sqrt {{\frac {N_{A}N_{B}}{N_{AB}}}{\big \|}\mu _{A}-\mu _{B}{\big \|}^{2}}}$

These distances can also be used to initialize the distance matrix for hierarchical clustering, depending on the chosen linkage. For accurate hierarchical clustering and k-means clustering, we also need to use the node weight $N {\displaystyle N}$ .

Clustering Step

[edit]

The CF-tree provides a compressed summary of the data set, but the leaves themselves only provide a very poor data clustering.In a second step, the leaves can be clustered using, e.g.,

k-means clustering, where leaves are weighted by the numbers of points, N.
k-means++, by sampling cluster features proportional to $S+N\min _{i}||\mu -c_{i}||$ where the $c_{i}$ are the previously chosen centers, and $(N,\mu ,S)$ is the BETULA cluster feature.
Gaussian mixture modeling, where also the variance S can be taken into account, and if the leaves store covariances, also the covariances.
Hierarchical agglomerative clustering, where the linkage can be initialized using the following equivalence of linkages to BIRCH distances:^[5]

Correspondence between HAC linkages and BIRCH distances^[5]
HAC Linkage	BIRCH distance
UPGMA	D2²
WPGMA	D0²
Ward	2 D4²

Availability

[edit]

ELKI contains BIRCH and BETULA.
scikit-learn contains a limited version of BIRCH, which only supports D0 distance, static thresholds, and which uses only the centroids of the leaves in the clustering step.^[6]

References

[edit]

^^a ^bZhang, T.; Ramakrishnan, R.; Livny, M. (1996). "BIRCH: an efficient data clustering method for very large databases".Proceedings of the 1996 ACM SIGMOD international conference on Management of data - SIGMOD '96. pp. 103–114.doi:10.1145/233269.233324.
^^a ^bLang, Andreas; Schubert, Erich (2020),"BETULA: Numerically Stable CF-Trees for BIRCH Clustering",Similarity Search and Applications, pp. 281–296,arXiv:2006.12881,doi:10.1007/978-3-030-60936-8_22,ISBN 978-3-030-60935-1,S2CID 219980434, retrieved2021-01-16
^"2006 SIGMOD Test of Time Award". Archived fromthe original on 2010-05-23.
^^a ^bLang, Andreas; Schubert, Erich (2022)."BETULA: Fast clustering of large data with improved BIRCH CF-Trees".Information Systems.108 101918.doi:10.1016/j.is.2021.101918.
^^a ^bSchubert, Erich; Lang, Andreas (2022-12-31), "5 Cluster Analysis",Machine Learning under Resource Constraints - Fundamentals, De Gruyter, pp. 215–226,arXiv:2309.02552,doi:10.1515/9783110785944-005,ISBN 978-3-11-078594-4
^as discussed in[1]

Retrieved from "https://en.wikipedia.org/w/index.php?title=BIRCH&oldid=1303434644"

Category:

Cluster analysis algorithms

Hidden categories:

[8]ページ先頭