Movatterモバイル変換

OPTICS algorithm

From Wikipedia, the free encyclopedia

Algorithm for finding density based clusters in spatial data

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based^[1]clusters in spatial data. It was presented in 1999 by Mihael Ankerst, Markus M. Breunig,Hans-Peter Kriegel andJörg Sander.^[2]Its basic idea is similar toDBSCAN,^[3] but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. To do so, the points of the database are (linearly) ordered such that spatially closest points become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that must be accepted for a cluster so that both points belong to the same cluster. This is represented as adendrogram.

Basic idea

[edit]

LikeDBSCAN, OPTICS requires two parameters:ε, which describes the maximum distance (radius) to consider, andMinPts, describing the number of points required to form a cluster. A pointp is acore point if at leastMinPts points are found within itsε-neighborhood $N_{\varepsilon }(p)$ (including pointp itself). In contrast toDBSCAN, OPTICS also considers points that are part of a more densely packed cluster, so each point is assigned acore distance that describes the distance to theMinPtsth closest point:

{\text{core-dist}}_{\mathit {\varepsilon ,MinPts}}(p)={\begin{cases}{\text{UNDEFINED}}&{\text{if }}|N_{\varepsilon }(p)|<{\mathit {MinPts}}\\{\mathit {MinPts}}{\text{-th smallest distance in }}N_{\varepsilon }(p)&{\text{otherwise}}\end{cases}}

Thereachability-distance of another pointo from a pointp is either the distance betweeno andp, or the core distance ofp, whichever is bigger:

{\text{reachability-dist}}_{\mathit {\varepsilon ,MinPts}}(o,p)={\begin{cases}{\text{UNDEFINED}}&{\text{if }}|N_{\varepsilon }(p)|<{\mathit {MinPts}}\\\max({\text{core-dist}}_{\mathit {\varepsilon ,MinPts}}(p),{\text{dist}}(p,o))&{\text{otherwise}}\end{cases}}

Ifp ando are nearest neighbors, this is the $\varepsilon '<\varepsilon$ we need to assume to havep ando belong to the same cluster.

Both core-distance and reachability-distance are undefined if no sufficiently dense cluster (w.r.t.ε) is available. Given a sufficiently largeε, this never happens, but then everyε-neighborhood query returns the entire database, resulting in $O(n^{2})$ runtime. Hence, theε parameter is required to cut off the density of clusters that are no longer interesting, and to speed up the algorithm.

The parameterε is, strictly speaking, not necessary. It can simply be set to the maximum possible value. When a spatial index is available, however, it does play a practical role with regards to complexity. OPTICS abstracts from DBSCAN by removing this parameter, at least to the extent of only having to give the maximum value.

Pseudocode

[edit]

The basic approach of OPTICS is similar toDBSCAN, but instead of maintaining known, but so far unprocessed cluster members in a set, they are maintained in apriority queue (e.g. using an indexedheap).

function OPTICS(DB, ε, MinPts)isfor each point p of DBdo        p.reachability-distance = UNDEFINEDfor each unprocessed point p of DBdo        N = getNeighbors(p, ε)        mark p as processed        output p to the ordered listif core-distance(p, ε, MinPts) != UNDEFINEDthen            Seeds = empty priority queue            update(N, p, Seeds, ε, MinPts)for each next q in Seedsdo                N' = getNeighbors(q, ε)                mark q as processed                output q to the ordered listif core-distance(q, ε, MinPts) != UNDEFINEDdo                    update(N', q, Seeds, ε, MinPts)

In update(), the priority queue Seeds is updated with the $\varepsilon$ -neighborhood of $p {\displaystyle p}$ and $q {\displaystyle q}$ , respectively:

function update(N, p, Seeds, ε, MinPts)is    coredist = core-distance(p, ε, MinPts)for each o in Nif o is not processedthen            new-reach-dist = max(coredist, dist(p,o))if o.reachability-distance == UNDEFINEDthen // o is not in Seeds                o.reachability-distance = new-reach-dist                Seeds.insert(o, new-reach-dist)else               // o in Seeds, check for improvementif new-reach-dist < o.reachability-distancethen                    o.reachability-distance = new-reach-dist                    Seeds.move-up(o, new-reach-dist)

OPTICS hence outputs the points in a particular ordering, annotated with their smallest reachability distance (in the original algorithm, the core distance is also exported, but this is not required for further processing).

Extracting the clusters

[edit]

Using areachability-plot (a special kind ofdendrogram), the hierarchical structure of the clusters can be obtained easily. It is a 2D plot, with the ordering of the points as processed by OPTICS on the x-axis and the reachability distance on the y-axis. Since points belonging to a cluster have a low reachability distance to their nearest neighbor, the clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.

The image above illustrates this concept. In its upper left area, a synthetic example data set is shown. The upper right part visualizes thespanning tree produced by OPTICS, and the lower part shows the reachability plot as computed by OPTICS. Colors in this plot are labels, and not computed by the algorithm; but it is well visible how the valleys in the plot correspond to the clusters in above data set. The yellow points in this image are considered noise, and no valley is found in their reachability plot. They are usually not assigned to clusters, except the omnipresent "all data" cluster in a hierarchical result.

Extracting clusters from this plot can be done manually by selecting ranges on the x-axis after visual inspection, by selecting a threshold on the y-axis (the result is then similar to a DBSCAN clustering result with the same $\varepsilon$ and minPts parameters; here a value of 0.1 may yield good results), or by different algorithms that try to detect the valleys by steepness, knee detection, or local maxima. A range of the plot beginning with a steep descent and ending with a steep ascent is considered a valley, and corresponds to a contiguous area of high density. Additional care must be taken to the last points in a valley to assign them to the inner or outer cluster, this can be achieved by considering the predecessor.^[4] Clusterings obtained this way usually arehierarchical, and cannot be achieved by a single DBSCAN run.

Complexity

[edit]

LikeDBSCAN, OPTICS processes each point once, and performs one $\varepsilon$ -neighborhood query during this processing. Given aspatial index that grants a neighborhood query in $O(\log n)$ runtime, an overall runtime of $O(n\cdot \log n)$ is obtained. The worst case however is $O(n^{2})$ , as with DBSCAN. The authors of the original OPTICS paper report an actual constant slowdown factor of 1.6 compared to DBSCAN. Note that the value of $\varepsilon$ might heavily influence the cost of the algorithm, since a value too large might raise the cost of a neighborhood query to linear complexity.

In particular, choosing $\varepsilon >\max _{x,y}d(x,y)$ (larger than the maximum distance in the data set) is possible, but leads to quadratic complexity, since every neighborhood query returns the full data set. Even when no spatial index is available, this comes at additional cost in managing the heap. Therefore, $\varepsilon$ should be chosen appropriately for the data set.

Extensions

[edit]

OPTICS-OF^[5] is anoutlier detection algorithm based on OPTICS. The main use is the extraction of outliers from an existing run of OPTICS at low cost compared to using a different outlier detection method. The better known versionLOF is based on the same concepts.

DeLi-Clu,^[6] Density-Link-Clustering combines ideas fromsingle-linkage clustering and OPTICS, eliminating the $\varepsilon$ parameter and offering performance improvements over OPTICS.

HiSC^[7] is a hierarchicalsubspace clustering (axis-parallel) method based on OPTICS.

HiCO^[8] is a hierarchicalcorrelation clustering algorithm based on OPTICS.

DiSH^[9] is an improvement over HiSC that can find more complex hierarchies.

FOPTICS^[10] is a faster implementation using random projections.

HDBSCAN*^[11] is based on a refinement of DBSCAN, excluding border-points from the clusters and thus following more strictly the basic definition of density-levels by Hartigan.^[12]

Availability

[edit]

Java implementations of OPTICS, OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH are available in theELKI data mining framework (with index acceleration for several distance functions, and with automatic cluster extraction using the ξ extraction method). Other Java implementations include theWeka extension (no support for ξ cluster extraction).

TheR package "dbscan" includes a C++ implementation of OPTICS (with both traditional dbscan-like and ξ cluster extraction) using ak-d tree for index acceleration for Euclidean distance only.

Python implementations of OPTICS are available in thePyClustering library and inscikit-learn. HDBSCAN* is available in thehdbscan library.

References

[edit]

^Kriegel, Hans-Peter; Kröger, Peer;Sander, Jörg;Zimek, Arthur (May 2011)."Density-based clustering".Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.1 (3):231–240.doi:10.1002/widm.30.S2CID 36920706.
^Ankerst, Mihael; Breunig, Markus M.;Kriegel, Hans-Peter;Sander, Jörg (1999). "OPTICS: Ordering points to identify the clustering structure".ACM SIGMOD Record.28 (2):49–60.doi:10.1145/304181.304187.
^Martin Ester;Hans-Peter Kriegel;Jörg Sander;Xiaowei Xu (1996). Evangelos Simoudis; Jiawei Han; Usama M. Fayyad (eds.).A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96).AAAI Press. pp. 226–231.CiteSeerX 10.1.1.71.1980.ISBN 1-57735-004-9.
^Schubert, Erich; Gertz, Michael (2018-08-22).Improving the Cluster Structure Extracted from OPTICS Plots(PDF). Lernen, Wissen, Daten, Analysen (LWDA 2018). Vol. CEUR-WS 2191. pp. 318–329 – via CEUR-WS.
^Markus M. Breunig;Hans-Peter Kriegel; Raymond T. Ng;Jörg Sander (1999)."OPTICS-OF: Identifying Local Outliers".Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. Vol. 1704.Springer-Verlag. pp. 262–270.doi:10.1007/b72280.ISBN 978-3-540-66490-1.S2CID 27352458.
^Achtert, Elke; Böhm, Christian; Kröger, Peer (2006). "DeLi-Clu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking". In Ng, Wee Keong; Kitsuregawa, Masaru; Li, Jianzhong; Chang, Kuiyu (eds.).Advances in Knowledge Discovery and Data Mining, 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, 2006, Proceedings. Lecture Notes in Computer Science. Vol. 3918. Springer. pp. 119–128.doi:10.1007/11731139_16.ISBN 978-3-540-33206-0.
^Achtert, Elke; Böhm, Christian;Kriegel, Hans-Peter; Kröger, Peer; Müller-Gorman, Ina;Zimek, Arthur (2006). "Finding Hierarchies of Subspace Clusters". In Fürnkranz, Johannes; Scheffer, Tobias; Spiliopoulou, Myra (eds.).Knowledge Discovery in Databases: PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, September 18-22, 2006, Proceedings. Lecture Notes in Computer Science. Vol. 4213. Springer. pp. 446–453.doi:10.1007/11871637_42.ISBN 978-3-540-45374-1.
^Achtert, E.; Böhm, C.; Kröger, P.;Zimek, A. (2006). "Mining Hierarchies of Correlation Clusters".18th International Conference on Scientific and Statistical Database Management (SSDBM'06). pp. 119–128.CiteSeerX 10.1.1.707.7872.doi:10.1109/SSDBM.2006.35.ISBN 978-0-7695-2590-7.S2CID 2679909.
^Achtert, Elke; Böhm, Christian;Kriegel, Hans-Peter; Kröger, Peer; Müller-Gorman, Ina;Zimek, Arthur (2007). "Detection and Visualization of Subspace Cluster Hierarchies". In Ramamohanarao, Kotagiri; Krishna, P. Radha; Mohania, Mukesh K.; Nantajeewarawat, Ekawit (eds.).Advances in Databases: Concepts, Systems and Applications, 12th International Conference on Database Systems for Advanced Applications, DASFAA 2007, Bangkok, Thailand, April 9-12, 2007, Proceedings. Lecture Notes in Computer Science. Vol. 4443. Springer. pp. 152–163.doi:10.1007/978-3-540-71703-4_15.ISBN 978-3-540-71702-7.
^Schneider, Johannes; Vlachos, Michail (2013). "Fast parameterless density-based clustering via random projections".Proceedings of the 22nd ACM international conference on Information & Knowledge Management. pp. 861–866.doi:10.1145/2505515.2505590.ISBN 978-1-4503-2263-8.
^Campello, Ricardo J. G. B.; Moulavi, Davoud;Zimek, Arthur;Sander, Jörg (22 July 2015). "Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection".ACM Transactions on Knowledge Discovery from Data.10 (1):1–51.doi:10.1145/2733381.S2CID 2887636.
^J.A. Hartigan (1975).Clustering algorithms. John Wiley & Sons.

Retrieved from "https://en.wikipedia.org/w/index.php?title=OPTICS_algorithm&oldid=1316956447"

Category:

Cluster analysis algorithms

Hidden categories:

[8]ページ先頭