dbscan#

sklearn.cluster.dbscan(X,eps=0.5,*,min_samples=5,metric='minkowski',metric_params=None,algorithm='auto',leaf_size=30,p=2,sample_weight=None,n_jobs=None)[source]#

Perform DBSCAN clustering from vector array or distance matrix.

Read more in theUser Guide.

Parameters:
X{array-like, sparse (CSR) matrix} of shape (n_samples, n_features) or (n_samples, n_samples)

A feature array, or array of distances between samples ifmetric='precomputed'.

epsfloat, default=0.5

The maximum distance between two samples for one to be consideredas in the neighborhood of the other. This is not a maximum boundon the distances of points within a cluster. This is the mostimportant DBSCAN parameter to choose appropriately for your data setand distance function.

min_samplesint, default=5

The number of samples (or total weight) in a neighborhood for a pointto be considered as a core point. This includes the point itself.

metricstr or callable, default=’minkowski’

The metric to use when calculating distance between instances in afeature array. If metric is a string or callable, it must be one ofthe options allowed bysklearn.metrics.pairwise_distances forits metric parameter.If metric is “precomputed”, X is assumed to be a distance matrix andmust be square during fit.X may be asparse graph,in which case only “nonzero” elements may be considered neighbors.

metric_paramsdict, default=None

Additional keyword arguments for the metric function.

Added in version 0.19.

algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’

The algorithm to be used by the NearestNeighbors moduleto compute pointwise distances and find nearest neighbors.See NearestNeighbors module documentation for details.

leaf_sizeint, default=30

Leaf size passed to BallTree or cKDTree. This can affect the speedof the construction and query, as well as the memory requiredto store the tree. The optimal value dependson the nature of the problem.

pfloat, default=2

The power of the Minkowski metric to be used to calculate distancebetween points.

sample_weightarray-like of shape (n_samples,), default=None

Weight of each sample, such that a sample with a weight of at leastmin_samples is by itself a core sample; a sample with negativeweight may inhibit its eps-neighbor from being core.Note that weights are absolute, and default to 1.

n_jobsint, default=None

The number of parallel jobs to run for neighbors search.None means1 unless in ajoblib.parallel_backend context.-1 meansusing all processors. SeeGlossary for more details.If precomputed distance are used, parallel execution is not availableand thus n_jobs will have no effect.

Returns:
core_samplesndarray of shape (n_core_samples,)

Indices of core samples.

labelsndarray of shape (n_samples,)

Cluster labels for each point. Noisy samples are given the label -1.

See also

DBSCAN

An estimator interface for this clustering algorithm.

OPTICS

A similar estimator interface clustering at multiple values of eps. Our implementation is optimized for memory usage.

Notes

For an example, seeDemo of DBSCAN clustering algorithm.

This implementation bulk-computes all neighborhood queries, which increasesthe memory complexity to O(n.d) where d is the average number of neighbors,while original DBSCAN had memory complexity O(n). It may attract a highermemory complexity when querying these nearest neighborhoods, dependingon thealgorithm.

One way to avoid the query complexity is to pre-compute sparseneighborhoods in chunks usingNearestNeighbors.radius_neighbors_graph withmode='distance', then usingmetric='precomputed' here.

Another way to reduce memory and computation time is to remove(near-)duplicate points and usesample_weight instead.

OPTICS provides a similar clustering with lowermemory usage.

References

Ester, M., H. P. Kriegel, J. Sander, and X. Xu,“A Density-BasedAlgorithm for Discovering Clusters in Large Spatial Databases with Noise”.In: Proceedings of the 2nd International Conference on Knowledge Discoveryand Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).“DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.”ACM Transactions on Database Systems (TODS), 42(3), 19.

Examples

>>>fromsklearn.clusterimportdbscan>>>X=[[1,2],[2,2],[2,3],[8,7],[8,8],[25,80]]>>>core_samples,labels=dbscan(X,eps=3,min_samples=2)>>>core_samplesarray([0, 1, 2, 3, 4])>>>labelsarray([ 0,  0,  0,  1,  1, -1])
On this page

This Page