Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

SMART Information Retrieval System

From Wikipedia, the free encyclopedia

TheSMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System is aninformation retrieval system developed atCornell University in the 1960s.[1] Many important concepts in information retrieval were developed as part of research on the SMART system, including thevector space model,relevance feedback, andRocchio classification.

Gerard Salton led the group that developed SMART. Other contributors includedMike Lesk.

The SMART system also provides a set of corpora, queries and reference rankings, taken from different subjects, notably

To the legacy of the SMART system belongs the so-called SMART triple notation, a mnemonic scheme for denotingtf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the formddd.qqq, where the first three letters represents the term weighting of the collection document vector and the second three letters represents the term weighting for the query document vector. For example,ltc.lnn represents theltc weighting applied to a collection document and thelnn weighting applied to a query document.

The following tables establish the SMART notation:[2]

Symbols and notation
Di={wi1,wi2,,wit}{\textstyle D_{i}=\{w_{i_{1}},w_{i_{2}},\ldots ,w_{i_{t}}\}} represents a document vector, wherewik{\textstyle w_{i_{k}}} is the weight of the termTk{\textstyle T_{k}} inDi{\textstyle D_{i}} andt{\displaystyle t} is the number of unique terms inDi{\textstyle D_{i}}. Positive features characterize terms that are present in a document, and the weight of zero is used for terms that are absent from a document.
fik{\textstyle f_{i_{k}}}Occurrence frequency of termTk{\textstyle T_{k}} in documentDi{\textstyle D_{i}}ui{\textstyle u_{i}}Number of unique terms in documentDi{\textstyle D_{i}}
N{\displaystyle N}Number of collection documentsavg(u){\displaystyle \operatorname {avg} (u)}Average number of unique terms in a document
nk{\textstyle n_{k}}Number of documents with termTk{\textstyle T_{k}} presentbt{\displaystyle b_{t}}Number of characters in documentDi{\displaystyle D_{i}}
max(fik){\displaystyle \max(f_{i_{k}})}Occurrence frequency of the most common term in documentDi{\displaystyle D_{i}}avg(b){\textstyle \operatorname {avg} (b)}Average number of characters in a document
avg(fik){\displaystyle \operatorname {avg} (f_{i_{k}})}Average occurrence frequency of a term in documentDi{\displaystyle D_{i}}G{\textstyle G}Global collection statistics
s{\displaystyle s}The slope in the context of pivoted document length normalization[3]
Smart term-weighting triple notation
Term frequencytf(fik){\textstyle {\text{tf}}(f_{i_{k}})}Document frequencydf(N,nk){\textstyle {\text{df}}(N,n_{k})}Document length normalizationg(G,Di){\textstyle g(G,D_{i})}
b1{\textstyle 1}Binary weightxn1{\textstyle 1}Disregards the collection frequencyxn1{\textstyle 1}No document length normalization
tnfik{\textstyle f_{i_{k}}}Raw term frequencyflog2(Nnk){\displaystyle \log _{2}\left({\frac {N}{n_{k}}}\right)}Inverse collection frequencyck=1twik2{\displaystyle {\sqrt {\sum _{k=1}^{t}w_{i_{k}}^{2}}}}Cosine normalization
a0.5+0.5fikmax(fik){\textstyle 0.5+0.5{\frac {f_{i_{k}}}{\max(f_{i_{k}})}}}Augmented normalized term frequencytlog2(N+1nk){\displaystyle \log _{2}\left({\frac {N+1}{n_{k}}}\right)}Inverse collection frequencyu1s+suiavg(u){\displaystyle 1-s+s{\frac {u_{i}}{\operatorname {avg} (u)}}}Pivoted unique normalization[3]
l1+log2fik{\displaystyle 1+\log _{2}f_{i_{k}}}Logarithmplog2(Nnknk){\displaystyle \log _{2}\left({\frac {N-n_{k}}{n_{k}}}\right)}Probabilistic inverse collection frequencyb1s+sbiavg(b){\displaystyle 1-s+s{\frac {b_{i}}{\operatorname {avg} (b)}}}Pivoted characted length normalization[3]
L1+log2(fik)1+log2(avg(fik)){\displaystyle {\frac {1+\log _{2}(f_{i_{k}})}{1+\log _{2}(\operatorname {avg} (f_{i_{k}}))}}}Average-term-frequency-based normalization[3]
d1+log2(1+log2(fik)){\displaystyle 1+\log _{2}(1+\log _{2}(f_{i_{k}}))}Double logarithm

The gray letters in the first, fifth, and ninth columns are the scheme used by Salton and Buckley in their 1988 paper.[4] The bold letters in the second, sixth, and tenth columns are the scheme used in experiments reported thereafter.

References

[edit]
  1. ^Salton, G, Lesk, M.E. (June 1965)."The SMART automatic document retrieval systems—an illustration".Communications of the ACM.8 (6):391–398.doi:10.1145/364955.364990.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  2. ^Palchowdhury, Sauparna (2016)."On The Provenance of tf-idf".sauparna.sdf.org. Retrieved2019-07-29.
  3. ^abcdSinghal, A., Buckley, C., & Mitra, M. (1996).Pivoted Document Length Normalization.SIGIR Forum, 51, 176-184.
  4. ^Salton, G., & Buckley, C. (1988).Term-Weighting Approaches in Automatic Text Retrieval.Inf. Process. Manage., 24, 513-523.

External links

[edit]


Stub icon

Thissoftware-engineering-related article is astub. You can help Wikipedia byexpanding it.

Retrieved from "https://en.wikipedia.org/w/index.php?title=SMART_Information_Retrieval_System&oldid=1292502268"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp