Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

GloVe

From Wikipedia, the free encyclopedia
Algorithm for obtaining vector representations of words
Not to be confused withGlove.

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is anunsupervised learning algorithm for obtainingvector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.[1] Training is performed on aggregated global word-wordco-occurrencestatistics from a corpus, and the resulting representations showcase interesting linear substructures of theword vector space. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.

It is developed as anopen-source project atStanford[2] and was launched in 2014. It was designed as a competitor toword2vec, and the original paper noted multiple improvements of GloVe over word2vec. As of 2022[update], both approaches are outdated, andTransformer-based models, such asBERT, which add multiple neural-network attention layers on top of a word embedding model similar to Word2vec, have come to be regarded as the state of the art in NLP.[3]

Definition

[edit]

You shall know a word by the company it keeps (Firth, J. R. 1957:11)[4]

The idea of GloVe is to construct, for each wordi{\displaystyle i}, two vectorswi,w~i{\displaystyle w_{i},{\tilde {w}}_{i}}, such that the relative positions of the vectors capture part of the statistical regularities of the wordi{\displaystyle i}. The statistical regularity is defined as the co-occurrence probabilities. Words that resemble each other in meaning should also resemble each other in co-occurrence probabilities.

Word counting

[edit]

Let thevocabulary beV{\displaystyle V}, the set of all possible words (aka "tokens"). Punctuation is either ignored, or treated as vocabulary, and similarly for capitalization and other typographical details.[1]

If two words occur close to each other, then we say that they occur in thecontext of each other. For example, if the context length is 3, then we say that in the following sentence

GloVe1, coined2 from3 Global4 Vectors5, is6 a7 model8 for9 distributed10 word11 representation12

the word "model8" is in the context of "word11" but not the context of "representation12".

A word is not in the context of itself, so "model8" is not in the context of the word "model8", although, if a word appears again in the same context, then it does count.

LetXij{\displaystyle X_{ij}} be the number of times that the wordj{\displaystyle j} appears in the context of the wordi{\displaystyle i} over the entire corpus. For example, if the corpus is just "I don't think that that is a problem." we haveXthat,that=2{\displaystyle X_{{\text{that}},{\text{that}}}=2} since the first "that" appears in the second one's context, and vice versa.

LetXi=jVXij{\displaystyle X_{i}=\sum _{j\in V}X_{ij}} be the number of words in the context of all instances of wordi{\displaystyle i}. By counting, we haveXi=2×(context size)×#(occurrences of word i){\displaystyle X_{i}=2\times ({\text{context size}})\times \#({\text{occurrences of word }}i)}(except for words occurring right at the start and end of the corpus)

Probabilistic modelling

[edit]

LetPik:=P(k|i):=XikXi{\displaystyle P_{ik}:=P(k|i):={\frac {X_{ik}}{X_{i}}}}be theco-occurrence probability. That is, if one samples a random occurrence of the wordi{\displaystyle i} in the entire document, and a random word within its context, that word isk{\displaystyle k} with probabilityPik{\displaystyle P_{ik}}. Note thatPikPki{\displaystyle P_{ik}\neq P_{ki}} in general. For example, in a typical modern English corpus,Pado,much{\displaystyle P_{{\text{ado}},{\text{much}}}} is close to one, butPmuch,ado{\displaystyle P_{{\text{much}},{\text{ado}}}} is close to zero. This is because the word "ado" is almost only used in the context of the archaic phrase "much ado about", but the word "much" occurs in all kinds of contexts.

For example, in a 6 billion token corpus, we have

Table 1 of[1]
Probability and Ratiok= solid {\displaystyle k={\text{ solid }}}k= gas {\displaystyle k={\text{ gas }}}k= water {\displaystyle k={\text{ water }}}k= fashion {\displaystyle k={\text{ fashion }}}
P(k ice ){\displaystyle P(k\mid {\text{ ice }})}1.9×104{\displaystyle 1.9\times 10^{-4}}6.6×105{\displaystyle 6.6\times 10^{-5}}3.0×103{\displaystyle 3.0\times 10^{-3}}1.7×105{\displaystyle 1.7\times 10^{-5}}
P(k steam ){\displaystyle P(k\mid {\text{ steam }})}2.2×105{\displaystyle 2.2\times 10^{-5}}7.8×104{\displaystyle 7.8\times 10^{-4}}2.2×103{\displaystyle 2.2\times 10^{-3}}1.8×105{\displaystyle 1.8\times 10^{-5}}
P(k ice )/P(k steam ){\displaystyle P(k\mid {\text{ ice }})/P(k\mid {\text{ steam }})}8.9{\displaystyle 8.9}8.5×102{\displaystyle 8.5\times 10^{-2}}1.36{\displaystyle 1.36}0.96{\displaystyle 0.96}

Inspecting the table, we see that the words "ice" and "steam" are indistinguishable along the "water" (often co-occurring with both) and "fashion" (rarely co-occurring with either), but distinguishable along the "solid" (co-occurring more with ice) and "gas" (co-occurring more with "steam").

The idea is to learn two vectorswi,w~i{\displaystyle w_{i},{\tilde {w}}_{i}} for each wordi{\displaystyle i}, such that we have amultinomial logistic regression:wiTw~j+bi+b~jlnPij{\displaystyle w_{i}^{T}{\tilde {w}}_{j}+b_{i}+{\tilde {b}}_{j}\approx \ln P_{ij}}and the termsbi,b~j{\displaystyle b_{i},{\tilde {b}}_{j}} are unimportant parameters.

This means that if the wordsi,j{\displaystyle i,j} have similar co-occurrence probabilities(Pik)kV(Pjk)kV{\displaystyle (P_{ik})_{k\in V}\approx (P_{jk})_{k\in V}}, then their vectors should also be similar:wiwj{\displaystyle w_{i}\approx w_{j}}.

Logistic regression

[edit]

Naively, logistic regression can be run by minimizing the squared loss:L=i,jV(wiTw~j+bi+b~jlnPij)2{\displaystyle L=\sum _{i,j\in V}(w_{i}^{T}{\tilde {w}}_{j}+b_{i}+{\tilde {b}}_{j}-\ln P_{ij})^{2}}However, this would be noisy for rare co-occurrences. To fix the issue, the squared loss is weighted so that the loss is slowly ramped-up as the absolute number of co-occurrencesXij{\displaystyle X_{ij}} increases:L=i,jVf(Xij)(wiTw~j+bi+b~jlnPij)2{\displaystyle L=\sum _{i,j\in V}f(X_{ij})(w_{i}^{T}{\tilde {w}}_{j}+b_{i}+{\tilde {b}}_{j}-\ln P_{ij})^{2}}wheref(x)={(x/xmax)α if x<xmax1 otherwise {\displaystyle f(x)=\left\{{\begin{array}{cc}\left(x/x_{\max }\right)^{\alpha }&{\text{ if }}x<x_{\max }\\1&{\text{ otherwise }}\end{array}}\right.}andxmax,α{\displaystyle x_{\max },\alpha } arehyperparameters. In the original paper, the authors found thatxmax=100,α=3/4{\displaystyle x_{\max }=100,\alpha =3/4} seem to work well in practice.

Use

[edit]

Once a model is trained, we have 4 trained parameters for each word:wi,w~i,bi,b~i{\displaystyle w_{i},{\tilde {w}}_{i},b_{i},{\tilde {b}}_{i}}. The parametersbi,b~i{\displaystyle b_{i},{\tilde {b}}_{i}} are irrelevant, and onlywi,w~i{\displaystyle w_{i},{\tilde {w}}_{i}} are relevant.

The authors recommended usingwi+w~i{\displaystyle w_{i}+{\tilde {w}}_{i}} as the final representation vector for wordi{\displaystyle i}, because empirically it worked better thanwi{\displaystyle w_{i}} orw~i{\displaystyle {\tilde {w}}_{i}} alone.

Applications

[edit]

GloVe can be used to find relations between words like synonyms, company-product relations, zip codes and cities, etc. However, the unsupervised learning algorithm is not effective in identifying homographs, i.e., words with the same spelling and different meanings. This is as the unsupervised learning algorithm calculates a single set of vectors for words with the same morphological structure.[5] The algorithm is also used by theSpaCy library to build semantic word embedding features, while computing the top list words that match with distance measures such ascosine similarity andEuclidean distance approach.[6] GloVe was also used as the word representation framework for the online and offline systems designed to detect psychological distress in patient interviews.[7]

See also

[edit]

References

[edit]
  1. ^abcPennington, Jeffrey; Socher, Richard; Manning, Christopher (October 2014). Moschitti, Alessandro; Pang, Bo; Daelemans, Walter (eds.)."GloVe: Global Vectors for Word Representation".Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics:1532–1543.doi:10.3115/v1/D14-1162.
  2. ^GloVe: Global Vectors for Word Representation (pdf)Archived 2020-09-03 at theWayback Machine "We use our insights to construct a new model for word representation which we call GloVe, for Global Vectors, because the global corpus statistics are captured directly by the model."
  3. ^Von der Mosel, Julian; Trautsch, Alexander; Herbold, Steffen (2022)."On the validity of pre-trained transformers for natural language processing in the software engineering domain".IEEE Transactions on Software Engineering.49 (4):1487–1507.arXiv:2109.04738.doi:10.1109/TSE.2022.3178469.ISSN 1939-3520.S2CID 237485425.
  4. ^Firth, J. R. (1957).Studies in Linguistic Analysis(PDF). Wiley-Blackwell.
  5. ^Wenig, Phillip (2019). "Creation of Sentence Embeddings Based on Topical Word Representations: An approach towards universal language understanding".Towards Data Science.
  6. ^Singh, Mayank; Gupta, P. K.; Tyagi, Vipin; Flusser, Jan; Ören, Tuncer I. (2018).Advances in Computing and Data Sciences: Second International Conference, ICACDS 2018, Dehradun, India, April 20-21, 2018, Revised Selected Papers. Singapore: Springer. p. 171.ISBN 9789811318122.
  7. ^Abad, Alberto; Ortega, Alfonso; Teixeira, António; Mateo, Carmen; Hinarejos, Carlos; Perdigão, Fernando; Batista, Fernando; Mamede, Nuno (2016).Advances in Speech and Language Technologies for Iberian Languages: Third International Conference, IberSPEECH 2016, Lisbon, Portugal, November 23-25, 2016, Proceedings. Cham: Springer. p. 165.ISBN 9783319491691.

External links

[edit]
General terms
Text analysis
Text segmentation
Automatic summarization
Machine translation
Distributional semantics models
Language resources,
datasets and corpora
Types and
standards
Data
Automatic identification
and data capture
Topic model
Computer-assisted
reviewing
Natural language
user interface
Related
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Retrieved from "https://en.wikipedia.org/w/index.php?title=GloVe&oldid=1269454879"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp