Movatterモバイル変換

Cosine similarity

From Wikipedia, the free encyclopedia

Similarity measure for number sequences

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Indata analysis,cosine similarity is ameasure of similarity between two non-zero vectors defined in aninner product space. Cosine similarity is thecosine of the angle between the vectors; that is, it is thedot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval $[-1,+1].$ For example, twoproportional vectors have a cosine similarity of +1, twoorthogonal vectors have a similarity of 0, and twoopposite vectors have a similarity of −1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in $[0,1]$ .

For example, ininformation retrieval andtext mining, each word is assigned a different coordinate and a document is represented by the vector of the numbers of occurrences of each word in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents.^[1]

The technique is also used to measure cohesion within clusters in the field ofdata mining.^[2]

One advantage of cosine similarity is itslow complexity, especially forsparse vectors: only the non-zero coordinates need to be considered.

Other names for cosine similarity includeOrchini similarity andTucker coefficient of congruence; theOtsuka–Ochiai similarity (see below) is cosine similarity applied tobinary data.^[3]

Definition

[edit]

The cosine of two non-zero vectors can be derived by using theEuclidean dot product formula:

\mathbf {A} \cdot \mathbf {B} =\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|\cos \theta

Given twon-dimensionalvectors of attributes,A andB, the cosine similarity,cos(θ), is represented using adot product andmagnitude as

{\text{cosine similarity}}=S_{C}(A,B):=\cos(\theta )={\mathbf {A} \cdot \mathbf {B}  \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}\cdot {\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}},

where $A_{i}$ and $B_{i}$ are the $i {\displaystyle i}$ thcomponents of vectors $\mathbf {A}$ and $\mathbf {B}$ , respectively.

The resulting similarity ranges from −1 meaning exactly opposite, to +1 meaning exactly the same, with 0 indicatingorthogonality ordecorrelation, while in-between values indicate intermediate similarity or dissimilarity.

Fortext matching, the attribute vectorsA andB are usually theterm frequency vectors of the documents. Cosine similarity can be seen as a method ofnormalizing document length during comparison. In the case ofinformation retrieval, the cosine similarity of two documents will range from $0\to 1$ , since the term frequencies cannot be negative. This remains true when usingTF-IDF weights. The angle between two term frequency vectors cannot be greater than 90°.

If the attribute vectors are normalized by subtracting the vector means (e.g., $A-{\bar {A}}$ ), the measure is called the centered cosine similarity and is equivalent to thePearson correlation coefficient. For an example of centering,

{\text{if}}\,A=[A_{1},A_{2}]^{T},{\text{ then }}{\bar {A}}=\left[{\frac {(A_{1}+A_{2})}{2}},{\frac {(A_{1}+A_{2})}{2}}\right]^{T},

{\text{ so }}A-{\bar {A}}=\left[{\frac {(A_{1}-A_{2})}{2}},{\frac {(-A_{1}+A_{2})}{2}}\right]^{T}.

Cosine distance

[edit]

When the distance between two unit-length vectors is defined to be the length of their vector difference then $\operatorname {dist} (\mathbf {A} ,\mathbf {B} )={\sqrt {(\mathbf {A} -\mathbf {B} )\cdot (\mathbf {A} -\mathbf {B} )}}={\sqrt {\mathbf {A} \cdot \mathbf {A} -2(\mathbf {A} \cdot \mathbf {B} )+\mathbf {B} \cdot \mathbf {B} }}={\sqrt {2(1-S_{C}(\mathbf {A} ,\mathbf {B} ))}}\,.$

Nonetheless thecosine distance^[4] is often defined without the square root or factor of 2:

{\text{cosine distance}}=D_{C}(A,B):=1-S_{C}(A,B)\,.

It is important to note that, by virtue of being proportional to squared Euclidean distance, the cosine distance is not a truedistance metric; it does not exhibit thetriangle inequality property — or, more formally, theSchwarz inequality — and it violates the coincidence axiom. To repair the triangle inequality property while maintaining the same ordering, one can convert toEuclidean distance ${\textstyle {\sqrt {2(1-S_{C}(A,B))}}}$ or angular distanceθ = arccos(S_C(A,B)). Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; seebelow.

Angular distance and similarity

[edit]

The normalized angle, referred to asangular distance, between any two vectors $A {\displaystyle A}$ and $B {\displaystyle B}$ is a formaldistance metric and can be calculated from the cosine similarity.^[5] The complement of the angular distance metric can then be used to defineangular similarity function bounded between 0 and 1, inclusive.

When the vector elements may be positive or negative:

{\text{angular distance}}=D_{\theta }:={\frac {\arccos({\text{cosine similarity}})}{\pi }}={\frac {\theta }{\pi }}

{\text{angular similarity}}=S_{\theta }:=1-{\text{angular distance}}=1-{\frac {\theta }{\pi }}

Or, if the vector elements are always positive:

{\text{angular distance}}=D_{\theta }:={\frac {2\cdot \arccos({\text{cosine similarity}})}{\pi }}={\frac {2\theta }{\pi }}

{\text{angular similarity}}=S_{\theta }:=1-{\text{angular distance}}=1-{\frac {2\theta }{\pi }}

Unfortunately, computing the inverse cosine (arccos) function is slow, making the use of the angular distance more computationally expensive than using the more common (but not metric) cosine distance above.

L₂-normalized Euclidean distance

[edit]

Another effective proxy for cosine distance can be obtained by $L_{2}$ normalisation of the vectors, followed by the application of normalEuclidean distance. Using this technique each term in each vector is first divided by the magnitude of the vector, yielding a vector of unit length. Then the Euclidean distance over the end-points of any two vectors is a proper metric which gives the same ordering as the cosine distance (amonotonic transformation of Euclidean distance; seebelow) for any comparison of vectors, and furthermore avoids the potentially expensive trigonometric operations required to yield a proper metric. Once the normalisation has occurred, the vector space can be used with the full range of techniques available to any Euclidean space, notably standarddimensionality reduction techniques. This normalised form distance is often used within manydeep learning algorithms.

Otsuka–Ochiai coefficient

[edit]

In biology, there is a similar concept known as the Otsuka–Ochiai coefficient named afterYanosuke Otsuka (also spelled as Ōtsuka, Ootsuka or Otuka,^[6]Japanese:大塚弥之助)^[7] and Akira Ochiai (Japanese:落合明),^[8] also known as the Ochiai–Barkman^[9] or Ochiai coefficient,^[10] which can be represented as:

K={\frac {|A\cap B|}{\sqrt {|A|\times |B|}}}

Here, $A {\displaystyle A}$ and $B {\displaystyle B}$ aresets, and $|A|$ is the number of elements in $A {\displaystyle A}$ . If sets are represented as bit vectors, the Otsuka–Ochiai coefficient can be seen to be the same as the cosine similarity. It is identical to the score introduced byGodfrey Thomson.^[11]

In a recent book,^[12] the coefficient is tentatively misattributed to another Japanese researcher with the family name Otsuka. The confusion arises because in 1957 Akira Ochiai attributes the coefficient only to Otsuka (no first name mentioned)^[8] by citing an article by Ikuso Hamai (Japanese:浜井生三),^[13] who in turn cites the original 1936 article by Yanosuke Otsuka.^[7]

Properties

[edit]

The most noteworthy property of cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions. For any positive constant $a {\displaystyle a}$ and vector $V {\displaystyle V}$ , the vectors $V {\displaystyle V}$ and $a V {\displaystyle aV}$ are maximally similar. The measure is thus most appropriate for data where frequency is more important than absolute values; notably, term frequency in documents.However more recent metrics with a grounding in information theory, such asJensen–Shannon, SED, and triangular divergence have been shown to have improved semantics in at least some contexts.^[14]

Cosine similarity is related toEuclidean distance as follows. Denote Euclidean distance by the usual $\|A-B\|$ , and observe that

\|A-B\|^{2}=(A-B)\cdot (A-B)=\|A\|^{2}+\|B\|^{2}-2(A\cdot B)\

(polarization identity)

byexpansion. WhenA andB are normalized to unit length, $\|A\|^{2}=\|B\|^{2}=1$ so this expression is equal to

2(1-\cos(A,B)).

In short, the cosine distance can be expressed in terms of Euclidean distance as

D_{C}(A,B)={\frac {\|A-B\|^{2}}{2}}\quad \mathrm {when} \quad \|A\|^{2}=\|B\|^{2}=1

The Euclidean distance is called thechord distance (because it is the length of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them.

Null distribution: For data which can be negative as well as positive, thenull distribution for cosine similarity is the distribution of thedot product of two independent randomunit vectors. This distribution has amean of zero and avariance of $1/n$ (where $n {\displaystyle n}$ is the number of dimensions), and although the distribution is bounded between −1 and +1, as $n {\displaystyle n}$ grows large the distribution is increasingly well-approximated by thenormal distribution.^[15]^[16] Other types of data such asbitstreams, which only take the values 0 or 1, the null distribution takes a different form and may have a nonzero mean.^[17]

Triangle inequality for cosine similarity

[edit]

The ordinarytriangle inequality for angles (i.e., arc lengths on a unit hypersphere) gives us that

|~\angle {AC}-\angle {CB}~|\leq ~\angle {AB}~\leq ~\angle {AC}~+~\angle {CB}~.

Because the cosine function decreases as an angle in[0,π] radians increases, the sense of these inequalities is reversed when we take the cosine of each value:

\cos(\angle {AC}-\angle {CB})\geq \cos(\angle {AB})\geq \cos(\angle {AC}+\angle {CB}).

Using the cosine addition and subtraction formulas, these two inequalities can be written in terms of the original cosines,

\cos(A,C)\cdot \cos(C,B)+{\sqrt {\left(1-\cos(A,C)^{2}\right)\cdot \left(1-\cos(C,B)^{2}\right)}}\geq \cos(A,B),

\cos(A,B)\geq \cos(A,C)\cdot \cos(C,B)-{\sqrt {\left(1-\cos(A,C)^{2}\right)\cdot \left(1-\cos(C,B)^{2}\right)}}.

This form of the triangle inequality can be used to bound the minimum and maximum similarity of two objects A and B if the similarities to a reference object C is already known. This is used for example in metric data indexing, but has also been used to accelerate sphericalk-means clustering^[18] the same way the Euclidean triangle inequality has been used to accelerate regular k-means.

Soft cosine measure

[edit]

A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features.^[19] The traditional cosine similarity considers thevector space model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity.

For example, in the field ofnatural language processing (NLP) the similarity among features is quite intuitive. Features such as words,n-grams, or syntacticn-grams^[20] can be quite similar, though formally they are considered as different features in the VSM. For example, words "play" and "game" are different words and thus mapped to different points in VSM; yet they are semantically related. In case ofn-grams or syntacticn-grams,Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well).

For calculating soft cosine, the matrixs is used to indicate similarity between features. It can be calculated through Levenshtein distance,WordNet similarity, or othersimilarity measures. Then we just multiply by this matrix.

Given twoN-dimension vectors $a {\displaystyle a}$ and $b {\displaystyle b}$ , the soft cosine similarity is calculated as follows:

{\begin{aligned}\operatorname {soft\_cosine} _{1}(a,b)={\frac {\sum \nolimits _{i,j}^{N}s_{ij}a_{i}b_{j}}{{\sqrt {\sum \nolimits _{i,j}^{N}s_{ij}a_{i}a_{j}}}{\sqrt {\sum \nolimits _{i,j}^{N}s_{ij}b_{i}b_{j}}}}},\end{aligned}}

wheres_ij = similarity(feature_i, feature_j).

If there is no similarity between features (s_ii = 1,s_ij = 0 fori ≠j), the given equation is equivalent to the conventional cosine similarity formula.

Thetime complexity of this measure is quadratic, which makes it applicable to real-world tasks. Note that the complexity can be reduced to subquadratic.^[21] An efficient implementation of such soft cosine similarity is included in theGensim open source library.

References

[edit]

^Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview".Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43.
^P.-N. Tan, M. Steinbach & V. Kumar,Introduction to Data Mining, Addison-Wesley (2005),ISBN 0-321-32136-7, chapter 8; page 500.
^"cosine_similarity: Function for calculating the cosine similarity".rDrr.io. Retrieved18 November 2024.
^Wolfram Research (2007)."CosineDistance – Wolfram Language & System Documentation Center".wolfram.com.{{cite web}}: CS1 maint: numeric names: authors list (link)
^"COSINE DISTANCE, COSINE SIMILARITY, ANGULAR COSINE DISTANCE, ANGULAR COSINE SIMILARITY".www.itl.nist.gov. Retrieved2020-07-11.
^Omori, Masae (2004)."Geological idea of Yanosuke Otuka, who built the foundation of neotectonics (geoscientist)".Earth Science.58 (4):256–259.doi:10.15080/agcjchikyukagaku.58.4_256.
^^a ^bOtsuka, Yanosuke (1936). "The faunal character of the Japanese Pleistocene marine Mollusca, as evidence of the climate having become colder during the Pleistocene in Japan".Bulletin of the Biogeographical Society of Japan.6 (16):165–170.
^^a ^bOchiai, Akira (1957)."Zoogeographical studies on the soleoid fishes found in Japan and its neighhouring regions-II".Bulletin of the Japanese Society of Scientific Fisheries.22 (9):526–530.doi:10.2331/suisan.22.526.
^Barkman, Jan J. (1958).Phytosociology and Ecology of Cryptogamic Epiphytes: Including a Taxonomic Survey and Description of Their Vegetation Units in Europe. Assen: Van Gorcum.
^Romesburg, H. Charles (1984).Cluster Analysis for Researchers. Belmont, California: Lifetime Learning Publications. p. 149.
^Thomson, Godfrey (1916)."A hierarchy without a general factor"(PDF).British Journal of Psychology.8:271–281.
^Howarth, Richard J. (2017).Dictionary of Mathematical Geosciences: With Historical Notes. Cham: Springer. p. 421.Bibcode:2017dmgh.book.....H.doi:10.1007/978-3-319-57315-1.ISBN 978-3-319-57314-4.S2CID 67081034.[…] attributed by him to "Otsuka" [?A. Otsuka of the Dept. of Fisheries, Tohoku University].
^Hamai, Ikuso (1955)."Stratification of community by means of "community coefficient" (continued)".Japanese Journal of Ecology.5 (1):41–45.doi:10.18960/seitai.5.1_41.
^Connor, Richard (2016).A Tale of Four Metrics. Similarity Search and Applications. Tokyo: Springer.doi:10.1007/978-3-319-46759-7_16.
^Spruill, Marcus C. (2007)."Asymptotic distribution of coordinates on high dimensional spheres".Electronic Communications in Probability.12:234–247.doi:10.1214/ECP.v12-1294.
^"Distribution of dot products between two random unit vectors in RD".CrossValidated.
^Graham L. Giller (2012). "The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity".Giller Investments Research Notes (20121024/1).doi:10.2139/ssrn.2167044.S2CID 123332455.
^Schubert, Erich; Lang, Andreas; Feher, Gloria (2021)."Accelerating Spherical k-Means". In Reyes, Nora; Connor, Richard; Kriege, Nils; Kazempour, Daniyal; Bartolini, Ilaria; Schubert, Erich; Chen, Jian-Jia (eds.).Similarity Search and Applications. Lecture Notes in Computer Science. Vol. 13058. Cham: Springer International Publishing. pp. 217–231.arXiv:2107.04074.doi:10.1007/978-3-030-89657-7_17.ISBN 978-3-030-89657-7.S2CID 235790358.
^Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, Helena; Pinto, David (29 September 2014)."Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model".Computación y Sistemas.18 (3):491–504.doi:10.13053/CyS-18-3-2043. Retrieved7 October 2014.
^Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana (2013).Advances in Computational Intelligence. Lecture Notes in Computer Science. Vol. 7630. LNAI 7630. pp. 1–11.doi:10.1007/978-3-642-37798-3_1.ISBN 978-3-642-37798-3.
^Novotný, Vít (2018).Implementation Notes for the Soft Cosine Measure. The 27th ACM International Conference on Information and Knowledge Management. Torun, Italy: Association for Computing Machinery. pp. 1639–1642.arXiv:1808.09407.doi:10.1145/3269206.3269317.ISBN 978-1-4503-6014-2.

External links

[edit]

v t e Machine learning evaluation metrics
Regression	MSE MAE sMAPE MAPE MASE MSPE RMS RMSE/RMSD R² MDA MAD
Classification	F-score P4 Accuracy Precision Recall Kappa MCC AUC ROC Sensitivity and specificity Logarithmic loss
Clustering	Silhouette Calinski–Harabasz index Davies–Bouldin index Dunn index Hopkins statistic Jaccard index Rand index Similarity measure SMC DBCV index
Ranking	MRR NDCG AP
Computer vision	PSNR SSIM IoU
NLP	Perplexity BLEU MAUVE
Deep learning	Inception score FID
Recommender system	Coverage Intra-list similarity
Similarity	Cosine similarity Euclidean distance Pearson correlation coefficient
Confusion matrix