dbscan#
- sklearn.cluster.dbscan(X,eps=0.5,*,min_samples=5,metric='minkowski',metric_params=None,algorithm='auto',leaf_size=30,p=2,sample_weight=None,n_jobs=None)[source]#
Perform DBSCAN clustering from vector array or distance matrix.
Read more in theUser Guide.
- Parameters:
- X{array-like, sparse (CSR) matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
A feature array, or array of distances between samples if
metric='precomputed'.- epsfloat, default=0.5
The maximum distance between two samples for one to be consideredas in the neighborhood of the other. This is not a maximum boundon the distances of points within a cluster. This is the mostimportant DBSCAN parameter to choose appropriately for your data setand distance function.
- min_samplesint, default=5
The number of samples (or total weight) in a neighborhood for a pointto be considered as a core point. This includes the point itself.
- metricstr or callable, default=’minkowski’
The metric to use when calculating distance between instances in afeature array. If metric is a string or callable, it must be one ofthe options allowed by
sklearn.metrics.pairwise_distancesforits metric parameter.If metric is “precomputed”, X is assumed to be a distance matrix andmust be square during fit.X may be asparse graph,in which case only “nonzero” elements may be considered neighbors.- metric_paramsdict, default=None
Additional keyword arguments for the metric function.
Added in version 0.19.
- algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’
The algorithm to be used by the NearestNeighbors moduleto compute pointwise distances and find nearest neighbors.See NearestNeighbors module documentation for details.
- leaf_sizeint, default=30
Leaf size passed to BallTree or cKDTree. This can affect the speedof the construction and query, as well as the memory requiredto store the tree. The optimal value dependson the nature of the problem.
- pfloat, default=2
The power of the Minkowski metric to be used to calculate distancebetween points.
- sample_weightarray-like of shape (n_samples,), default=None
Weight of each sample, such that a sample with a weight of at least
min_samplesis by itself a core sample; a sample with negativeweight may inhibit its eps-neighbor from being core.Note that weights are absolute, and default to 1.- n_jobsint, default=None
The number of parallel jobs to run for neighbors search.
Nonemeans1 unless in ajoblib.parallel_backendcontext.-1meansusing all processors. SeeGlossary for more details.If precomputed distance are used, parallel execution is not availableand thus n_jobs will have no effect.
- Returns:
- core_samplesndarray of shape (n_core_samples,)
Indices of core samples.
- labelsndarray of shape (n_samples,)
Cluster labels for each point. Noisy samples are given the label -1.
See also
Notes
For an example, seeDemo of DBSCAN clustering algorithm.
This implementation bulk-computes all neighborhood queries, which increasesthe memory complexity to O(n.d) where d is the average number of neighbors,while original DBSCAN had memory complexity O(n). It may attract a highermemory complexity when querying these nearest neighborhoods, dependingon the
algorithm.One way to avoid the query complexity is to pre-compute sparseneighborhoods in chunks using
NearestNeighbors.radius_neighbors_graphwithmode='distance', then usingmetric='precomputed'here.Another way to reduce memory and computation time is to remove(near-)duplicate points and use
sample_weightinstead.OPTICSprovides a similar clustering with lowermemory usage.References
Ester, M., H. P. Kriegel, J. Sander, and X. Xu,“A Density-BasedAlgorithm for Discovering Clusters in Large Spatial Databases with Noise”.In: Proceedings of the 2nd International Conference on Knowledge Discoveryand Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).“DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.”ACM Transactions on Database Systems (TODS), 42(3), 19.
Examples
>>>fromsklearn.clusterimportdbscan>>>X=[[1,2],[2,2],[2,3],[8,7],[8,8],[25,80]]>>>core_samples,labels=dbscan(X,eps=3,min_samples=2)>>>core_samplesarray([0, 1, 2, 3, 4])>>>labelsarray([ 0, 0, 0, 1, 1, -1])
