AgglomerativeClustering#

classsklearn.cluster.AgglomerativeClustering(n_clusters=2,*,metric='euclidean',memory=None,connectivity=None,compute_full_tree='auto',linkage='ward',distance_threshold=None,compute_distances=False)[source]#

Agglomerative Clustering.

Recursively merges pair of clusters of sample data; uses linkage distance.

Read more in theUser Guide.

Parameters:
n_clustersint or None, default=2

The number of clusters to find. It must beNone ifdistance_threshold is notNone.

metricstr or callable, default=”euclidean”

Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”,“manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only“euclidean” is accepted. If “precomputed”, a distance matrix is neededas input for the fit method. If connectivity is None, linkage is“single” and affinity is not “precomputed” any valid pairwise distancemetric can be assigned.

For an example of agglomerative clustering with different metrics, seeAgglomerative clustering with different metrics.

Added in version 1.2.

memorystr or object with the joblib.Memory interface, default=None

Used to cache the output of the computation of the tree.By default, no caching is done. If a string is given, it is thepath to the caching directory.

connectivityarray-like, sparse matrix, or callable, default=None

Connectivity matrix. Defines for each sample the neighboringsamples following a given structure of the data.This can be a connectivity matrix itself or a callable that transformsthe data into a connectivity matrix, such as derived fromkneighbors_graph. Default isNone, i.e, thehierarchical clustering algorithm is unstructured.

For an example of connectivity matrix usingkneighbors_graph, seeAgglomerative clustering with and without structure.

compute_full_tree‘auto’ or bool, default=’auto’

Stop early the construction of the tree atn_clusters. This isuseful to decrease computation time if the number of clusters is notsmall compared to the number of samples. This option is useful onlywhen specifying a connectivity matrix. Note also that when varying thenumber of clusters and using caching, it may be advantageous to computethe full tree. It must beTrue ifdistance_threshold is notNone. By defaultcompute_full_tree is “auto”, which is equivalenttoTrue whendistance_threshold is notNone or thatn_clustersis inferior to the maximum between 100 or0.02*n_samples.Otherwise, “auto” is equivalent toFalse.

linkage{‘ward’, ‘complete’, ‘average’, ‘single’}, default=’ward’

Which linkage criterion to use. The linkage criterion determines whichdistance to use between sets of observation. The algorithm will mergethe pairs of cluster that minimize this criterion.

  • ‘ward’ minimizes the variance of the clusters being merged.

  • ‘average’ uses the average of the distances of each observation ofthe two sets.

  • ‘complete’ or ‘maximum’ linkage uses the maximum distances betweenall observations of the two sets.

  • ‘single’ uses the minimum of the distances between all observationsof the two sets.

Added in version 0.20:Added the ‘single’ option

For examples comparing differentlinkage criteria, seeComparing different hierarchical linkage methods on toy datasets.

distance_thresholdfloat, default=None

The linkage distance threshold at or above which clusters will not bemerged. If notNone,n_clusters must beNone andcompute_full_tree must beTrue.

Added in version 0.21.

compute_distancesbool, default=False

Computes distances between clusters even ifdistance_threshold is notused. This can be used to make dendrogram visualization, but introducesa computational and memory overhead.

Added in version 0.24.

For an example of dendrogram visualization, seePlot Hierarchical Clustering Dendrogram.

Attributes:
n_clusters_int

The number of clusters found by the algorithm. Ifdistance_threshold=None, it will be equal to the givenn_clusters.

labels_ndarray of shape (n_samples)

Cluster labels for each point.

n_leaves_int

Number of leaves in the hierarchical tree.

n_connected_components_int

The estimated number of connected components in the graph.

Added in version 0.21:n_connected_components_ was added to replacen_components_.

n_features_in_int

Number of features seen duringfit.

Added in version 0.24.

feature_names_in_ndarray of shape (n_features_in_,)

Names of features seen duringfit. Defined only whenXhas feature names that are all strings.

Added in version 1.0.

children_array-like of shape (n_samples-1, 2)

The children of each non-leaf node. Values less thann_samplescorrespond to leaves of the tree which are the original samples.A nodei greater than or equal ton_samples is a non-leafnode and has childrenchildren_[i-n_samples]. Alternativelyat the i-th iteration, children[i][0] and children[i][1]are merged to form noden_samples+i.

distances_array-like of shape (n_nodes-1,)

Distances between nodes in the corresponding place inchildren_.Only computed ifdistance_threshold is used orcompute_distancesis set toTrue.

See also

FeatureAgglomeration

Agglomerative clustering but for features instead of samples.

ward_tree

Hierarchical clustering with ward linkage.

Examples

>>>fromsklearn.clusterimportAgglomerativeClustering>>>importnumpyasnp>>>X=np.array([[1,2],[1,4],[1,0],...[4,2],[4,4],[4,0]])>>>clustering=AgglomerativeClustering().fit(X)>>>clusteringAgglomerativeClustering()>>>clustering.labels_array([1, 1, 1, 0, 0, 0])

For a comparison of Agglomerative clustering with other clustering algorithms, seeComparing different clustering algorithms on toy datasets

fit(X,y=None)[source]#

Fit the hierarchical clustering from features, or distance matrix.

Parameters:
Xarray-like, shape (n_samples, n_features) or (n_samples, n_samples)

Training instances to cluster, or distances between instances ifmetric='precomputed'.

yIgnored

Not used, present here for API consistency by convention.

Returns:
selfobject

Returns the fitted instance.

fit_predict(X,y=None)[source]#

Fit and return the result of each sample’s clustering assignment.

In addition to fitting, this method also return the result of theclustering assignment for each sample in the training set.

Parameters:
Xarray-like of shape (n_samples, n_features) or (n_samples, n_samples)

Training instances to cluster, or distances between instances ifaffinity='precomputed'.

yIgnored

Not used, present here for API consistency by convention.

Returns:
labelsndarray of shape (n_samples,)

Cluster labels.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please checkUser Guide on how the routingmechanism works.

Returns:
routingMetadataRequest

AMetadataRequest encapsulatingrouting information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator andcontained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects(such asPipeline). The latter haveparameters of the form<component>__<parameter> so that it’spossible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

Gallery examples#

Agglomerative clustering with and without structure

Agglomerative clustering with and without structure

Agglomerative clustering with different metrics

Agglomerative clustering with different metrics

Plot Hierarchical Clustering Dendrogram

Plot Hierarchical Clustering Dendrogram

Comparing different clustering algorithms on toy datasets

Comparing different clustering algorithms on toy datasets

A demo of structured Ward hierarchical clustering on an image of coins

A demo of structured Ward hierarchical clustering on an image of coins

Various Agglomerative Clustering on a 2D embedding of digits

Various Agglomerative Clustering on a 2D embedding of digits

Inductive Clustering

Inductive Clustering

Comparing different hierarchical linkage methods on toy datasets

Comparing different hierarchical linkage methods on toy datasets

Hierarchical clustering: structured vs unstructured ward

Hierarchical clustering: structured vs unstructured ward