pairwise_distances#

sklearn.metrics.pairwise_distances(X,Y=None,metric='euclidean',*,n_jobs=None,force_all_finite='deprecated',ensure_all_finite=None,**kwds)[source]#

Compute the distance matrix from a feature array X and optional Y.

This function takes one or two feature arrays or a distance matrix, and returnsa distance matrix.

  • IfX is a feature array, of shape (n_samples_X, n_features), and:

    • Y isNone andmetric is not ‘precomputed’, the pairwise distancesbetweenX and itself are returned.

    • Y is a feature array of shape (n_samples_Y, n_features), the pairwisedistances betweenX andY is returned.

  • IfX is a distance matrix, of shape (n_samples_X, n_samples_X),metricshould be ‘precomputed’.Y is thus ignored andX is returned as is.

If the input is a collection of non-numeric data (e.g. a list of strings or aboolean array), a custom metric must be passed.

This method provides a safe way to take a distance matrix as input, whilepreserving compatibility with many other algorithms that take a vectorarray.

Valid values for metric are:

  • From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’,‘manhattan’, ‘nan_euclidean’]. All metrics support sparse matrixinputs except ‘nan_euclidean’.

  • Fromscipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’,‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’,‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’,‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’].These metrics do not support sparse matrix inputs.

Note

'kulsinski' is deprecated from SciPy 1.9 and will be removed in SciPy 1.11.

Note

'matching' has been removed in SciPy 1.9 (use'hamming' instead).

Note that in the case of ‘cityblock’, ‘cosine’ and ‘euclidean’ (which arevalidscipy.spatial.distance metrics), the scikit-learn implementationwill be used, which is faster and has support for sparse matrices (exceptfor ‘cityblock’). For a verbose description of the metrics fromscikit-learn, seesklearn.metrics.pairwise.distance_metricsfunction.

Read more in theUser Guide.

Parameters:
X{array-like, sparse matrix} of shape (n_samples_X, n_samples_X) or (n_samples_X, n_features)

Array of pairwise distances between samples, or a feature array.The shape of the array should be (n_samples_X, n_samples_X) ifmetric == “precomputed” and (n_samples_X, n_features) otherwise.

Y{array-like, sparse matrix} of shape (n_samples_Y, n_features), default=None

An optional second feature array. Only allowed ifmetric != “precomputed”.

metricstr or callable, default=’euclidean’

The metric to use when calculating distance between instances in afeature array. If metric is a string, it must be one of the optionsallowed byscipy.spatial.distance.pdist for its metric parameter, ora metric listed inpairwise.PAIRWISE_DISTANCE_FUNCTIONS.If metric is “precomputed”, X is assumed to be a distance matrix.Alternatively, if metric is a callable function, it is called on eachpair of instances (rows) and the resulting value recorded. The callableshould take two arrays from X as input and return a value indicatingthe distance between them.

n_jobsint, default=None

The number of jobs to use for the computation. This works by breakingdown the pairwise matrix into n_jobs even slices and computing themusing multithreading.

None means 1 unless in ajoblib.parallel_backend context.-1 means using all processors. SeeGlossaryfor more details.

The “euclidean” and “cosine” metrics rely heavily on BLAS which is alreadymultithreaded. So, increasingn_jobs would likely cause oversubscriptionand quickly degrade performance.

force_all_finitebool or ‘allow-nan’, default=True

Whether to raise an error on np.inf, np.nan, pd.NA in array. Ignoredfor a metric listed inpairwise.PAIRWISE_DISTANCE_FUNCTIONS. Thepossibilities are:

  • True: Force all values of array to be finite.

  • False: accepts np.inf, np.nan, pd.NA in array.

  • ‘allow-nan’: accepts only np.nan and pd.NA values in array. Valuescannot be infinite.

Added in version 0.22:force_all_finite accepts the string'allow-nan'.

Changed in version 0.23:Acceptspd.NA and converts it intonp.nan.

Deprecated since version 1.6:force_all_finite was renamed toensure_all_finite and will be removedin 1.8.

ensure_all_finitebool or ‘allow-nan’, default=True

Whether to raise an error on np.inf, np.nan, pd.NA in array. Ignoredfor a metric listed inpairwise.PAIRWISE_DISTANCE_FUNCTIONS. Thepossibilities are:

  • True: Force all values of array to be finite.

  • False: accepts np.inf, np.nan, pd.NA in array.

  • ‘allow-nan’: accepts only np.nan and pd.NA values in array. Valuescannot be infinite.

Added in version 1.6:force_all_finite was renamed toensure_all_finite.

**kwdsoptional keyword parameters

Any further parameters are passed directly to the distance function.If using a scipy.spatial.distance metric, the parameters are stillmetric dependent. See the scipy docs for usage examples.

Returns:
Dndarray of shape (n_samples_X, n_samples_X) or (n_samples_X, n_samples_Y)

A distance matrix D such that D_{i, j} is the distance between theith and jth vectors of the given matrix X, if Y is None.If Y is not None, then D_{i, j} is the distance between the ith arrayfrom X and the jth array from Y.

See also

pairwise_distances_chunked

Performs the same calculation as this function, but returns a generator of chunks of the distance matrix, in order to limit memory usage.

sklearn.metrics.pairwise.paired_distances

Computes the distances between corresponding elements of two arrays.

Notes

If metric is a callable, no restrictions are placed onX andY dimensions.

Examples

>>>fromsklearn.metrics.pairwiseimportpairwise_distances>>>X=[[0,0,0],[1,1,1]]>>>Y=[[1,0,0],[1,1,0]]>>>pairwise_distances(X,Y,metric='sqeuclidean')array([[1., 2.],       [2., 1.]])

Gallery examples#

Agglomerative clustering with different metrics

Agglomerative clustering with different metrics

Release Highlights for scikit-learn 1.5

Release Highlights for scikit-learn 1.5