Movatterモバイル変換

Explicit semantic analysis

From Wikipedia, the free encyclopedia

Innatural language processing andinformation retrieval,explicit semantic analysis (ESA) is avectoral representation of text (individual words or entire documents) that uses a document corpus as aknowledge base. Specifically, in ESA, a word is represented as a column vector in thetf–idf matrix of the text corpus and a document (string of words) is represented as thecentroid of the vectors representing its words. Typically, the text corpus isEnglish Wikipedia, though other corpora including theOpen Directory Project have been used.^[1]

ESA was designed byEvgeniy Gabrilovich and Shaul Markovitch as a means of improvingtext categorization^[2]and has been used by this pair of researchers to compute what they refer to as "semantic relatedness" by means ofcosine similarity between the aforementioned vectors, collectively interpreted as a space of "concepts explicitly defined and described by humans", where Wikipedia articles (or ODP entries, or otherwise titles of documents in the knowledge base corpus) are equated with concepts. The name "explicit semantic analysis" contrasts withlatent semantic analysis (LSA), because the use of a knowledge base makes it possible to assign human-readable labels to the concepts that make up the vector space.^[1]^[3]

Model

[edit]

To perform the basic variant of ESA, one starts with a collection of texts, say, all Wikipedia articles; let the number of documents in the collection beN. These are all turned into "bags of words", i.e., term frequency histograms, stored in aninverted index. Using this inverted index, one can find for any word the set of Wikipedia articles containing this word; in the vocabulary of Egozi, Markovitch and Gabrilovitch, "each word appearing in the Wikipedia corpus can be seen as triggering each of the concepts it points to in the inverted index."^[1]

The output of the inverted index for a single word query is a list of indexed documents (Wikipedia articles), each given a score depending on how often the word in question occurred in them (weighted by the total number of words in the document). Mathematically, this list is anN-dimensional vector of word-document scores, where a document not containing the query word has score zero. To compute the relatedness of two words, one compares the vectors (sayu andv) by computing the cosine similarity,

{\mathsf {sim}}(\mathbf {u} ,\mathbf {v} )={\frac {\mathbf {u} \cdot \mathbf {v} }{\|\mathbf {u} \|\|\mathbf {v} \|}}={\frac {\sum _{i=1}^{N}u_{i}v_{i}}{{\sqrt {\sum _{i=1}^{N}u_{i}^{2}}}{\sqrt {\sum _{i=1}^{N}v_{i}^{2}}}}}

and this gives a numeric estimate of the semantic relatedness of the words. The scheme is extended from single words to multi-word texts by simply summing the vectors of all words in the text.^[3]

Analysis

[edit]

ESA, as originally posited by Gabrilovich and Markovitch, operates under the assumption that the knowledge base contains topicallyorthogonal concepts. However, it was later shown by Anderka and Stein that ESA also improves the performance ofinformation retrieval systems when it is based not on Wikipedia, but on theReuters corpus of newswire articles, which does not satisfy the orthogonality property; in their experiments, Anderka and Stein used newswire stories as "concepts".^[4] To explain this observation, links have been shown between ESA and thegeneralized vector space model.^[5] Gabrilovich and Markovitch replied to Anderka and Stein by pointing out that their experimental result was achieved using "a single application of ESA (text similarity)" and "just a single, extremely small and homogenous test collection of 50 news documents".^[1]

Applications

[edit]

Word relatedness

[edit]

ESA is considered by its authors a measure of semantic relatedness (as opposed tosemantic similarity). On datasets used to benchmark relatedness of words, ESA outperforms other algorithms, includingWordNet semantic similarity measures and skip-gram Neural Network Language Model (Word2vec).^[6]

Document relatedness

[edit]

ESA is used in commercial software packages for computing relatedness of documents.^[7] Domain-specific restrictions on the ESA model are sometimes used to provide more robust document matching.^[8]

Extensions

[edit]

Cross-language explicit semantic analysis (CL-ESA) is a multilingual generalization of ESA.^[9] CL-ESA exploits a document-aligned multilingual reference collection (e.g., again, Wikipedia) to represent a document as a language-independent concept vector. The relatedness of two documents in different languages is assessed by the cosine similarity between the corresponding vector representations.

References

[edit]

^^a ^b ^c ^dEgozi, Ofer; Markovitch, Shaul; Gabrilovich, Evgeniy (2011)."Concept-Based Information Retrieval using Explicit Semantic Analysis"(PDF).ACM Transactions on Information Systems.29 (2):1–34.doi:10.1145/1961209.1961211.S2CID 743663. RetrievedJanuary 3, 2015.
^Gabrilovich, Evgeniy; Markovitch, Shaul (2006).Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge(PDF). Proc. 21st National Conference on Artificial Intelligence (AAAI). pp. 1301–1306.
^^a ^bGabrilovich, Evgeniy; Markovitch, Shaul (2007).Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis(PDF). Proc. 20th Int'l Joint Conf. on Artificial Intelligence (IJCAI). pp. 1606–1611.
^Maik Anderka and Benno Stein.The ESA retrieval model revisited Archived 2012-06-10 at theWayback Machine. Proceedings of the 32nd International ACM Conference on Research and Development in Information Retrieval (SIGIR), pp. 670-671, 2009.
^Thomas Gottron, Maik Anderka and Benno Stein.Insights into explicit semantic analysis Archived 2012-06-10 at theWayback Machine. Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM), pp. 1961-1964, 2011.
^Kliegr, Tomáš, and Ondřej Zamazal.Antonyms are similar: Towards paradigmatic association approach to rating similarity in SimLex-999 and WordSim-353. Data & Knowledge Engineering 115 (2018): 174-193. (source may be paywalled,mirror)
^Marc Hornick (November 17, 2017)."Explicit Semantic Analysis (ESA) for Text Analytics".blogs.oracle.com. Retrieved31 March 2023.
^Luca Mazzola, Patrick Siegfried, Andreas Waldis, Michael Kaufmann, Alexander Denzler.A Domain Specific ESA Inspired Approach for Document Semantic Description. Proceedings of the 9th IEEE Conf. on Intelligent Systems 2018 (IS), pp. 383-390, 2018.
^Martin Potthast, Benno Stein, and Maik Anderka.A Wikipedia-based multilingual retrieval model Archived 2012-06-10 at theWayback Machine. Proceedings of the 30th European Conference on IR Research (ECIR), pp. 522-530, 2008.

External links

[edit]

Explicit semantic analysis on Evgeniy Gabrilovich's homepage; has links to implementations

Natural language processing

General terms

Text analysis

Text segmentation	Compound-term processing Lemmatisation Lexical analysis Text chunking Stemming Sentence segmentation Word segmentation

Automatic summarization

Machine translation

Distributional semantics models

Language resources,
datasets and corpora

Types and standards	Corpus linguistics Lexical resource Linguistic Linked Open Data Machine-readable dictionary Parallel text PropBank Semantic network Simple Knowledge Organization System Speech corpus Text corpus Thesaurus (information retrieval) Treebank Universal Dependencies
Data	BabelNet Bank of English DBpedia FrameNet Google Ngram Viewer UBY WordNet Wikidata