DBSCAN #

classsklearn.cluster.DBSCAN(eps=0.5,*,min_samples=5,metric='euclidean',metric_params=None,algorithm='auto',leaf_size=30,p=None,n_jobs=None)[source]#

Perform DBSCAN clustering from vector array or distance matrix.

DBSCAN - Density-Based Spatial Clustering of Applications with Noise.Finds core samples of high density and expands clusters from them.Good for data which contains clusters of similar density.

This implementation has a worst case memory complexity of$O({n}^2)$,which can occur when theeps param is large andmin_samples is low,while the original DBSCAN only uses linear memory.For further details, see the Notes below.

See also

OPTICS: A similar clustering at multiple values of eps. Our implementation is optimized for memory usage.

Notes

This implementation bulk-computes all neighborhood queries, which increasesthe memory complexity to O(n.d) where d is the average number of neighbors,while original DBSCAN had memory complexity O(n). It may attract a highermemory complexity when querying these nearest neighborhoods, dependingon thealgorithm.

One way to avoid the query complexity is to pre-compute sparseneighborhoods in chunks usingNearestNeighbors.radius_neighbors_graph withmode='distance', then usingmetric='precomputed' here.

Another way to reduce memory and computation time is to remove(near-)duplicate points and usesample_weight instead.

OPTICS provides a similar clustering with lower memoryusage.

References

Ester, M., H. P. Kriegel, J. Sander, and X. Xu,“A Density-BasedAlgorithm for Discovering Clusters in Large Spatial Databases with Noise”.In: Proceedings of the 2nd International Conference on Knowledge Discoveryand Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).“DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.”ACM Transactions on Database Systems (TODS), 42(3), 19.

Examples

>>>fromsklearn.clusterimportDBSCAN>>>importnumpyasnp>>>X=np.array([[1,2],[2,2],[2,3],...[8,7],[8,8],[25,80]])>>>clustering=DBSCAN(eps=3,min_samples=2).fit(X)>>>clustering.labels_array([ 0,  0,  0,  1,  1, -1])>>>clusteringDBSCAN(eps=3, min_samples=2)

For an example, seeDemo of DBSCAN clustering algorithm.

For a comparison of DBSCAN with other clustering algorithms, seeComparing different clustering algorithms on toy datasets

fit(X,y=None,sample_weight=None)[source]#

Perform DBSCAN clustering from features, or distance matrix.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples): Training instances to cluster, or distances between instances ifmetric='precomputed'. If a sparse matrix is provided, it willbe converted into a sparsecsr_matrix.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: Weight of each sample, such that a sample with a weight of at leastmin_samples is by itself a core sample; a sample with anegative weight may inhibit its eps-neighbor from being core.Note that weights are absolute, and default to 1.

Returns:

selfobject: Returns a fitted instance of self.

fit_predict(X,y=None,sample_weight=None)[source]#

Compute clusters from a data or distance matrix and predict labels.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples): Training instances to cluster, or distances between instances ifmetric='precomputed'. If a sparse matrix is provided, it willbe converted into a sparsecsr_matrix.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: Weight of each sample, such that a sample with a weight of at leastmin_samples is by itself a core sample; a sample with anegative weight may inhibit its eps-neighbor from being core.Note that weights are absolute, and default to 1.

Returns:

labelsndarray of shape (n_samples,): Cluster labels. Noisy samples are given the label -1.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please checkUser Guide on how the routingmechanism works.

Returns:

routingMetadataRequest: AMetadataRequest encapsulatingrouting information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator andcontained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_fit_request(*,sample_weight:bool|None|str='$UNCHANGED$')→DBSCAN[source]#

Configure whether metadata should be requested to be passed to thefit method.

Note that this method is only relevant when this estimator is used as asub-estimator within ameta-estimator and metadata routing is enabledwithenable_metadata_routing=True (seesklearn.set_config).Please check theUser Guide on how the routingmechanism works.

The options for each parameter are:

True: metadata is requested, and passed tofit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it tofit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains theexisting request. This allows you to change the request for someparameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing forsample_weight parameter infit.

Returns:

selfobject: The updated object.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects(such asPipeline). The latter haveparameters of the form<component>__<parameter> so that it’spossible to update each component of a nested object.

Parameters: