Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

w-shingling

From Wikipedia, the free encyclopedia
This article includes alist of references,related reading, orexternal links,but its sources remain unclear because it lacksinline citations. Please helpimprove this article byintroducing more precise citations.(March 2023) (Learn how and when to remove this message)

Innatural language processing aw-shingling is a set ofuniqueshingles (thereforen-grams) each of which is composed of contiguoussubsequences oftokens within adocument, which can then be used to ascertain thesimilarity between documents. The symbolw denotes the quantity of tokens in each shingle selected, or solved for.

The document, "a rose is a rose is a rose" can therefore be maximallytokenized as follows:

(a,rose,is,a,rose,is,a,rose)

Theset of all contiguoussequences of 4 tokens (Thus 4=n, thus 4-grams) is

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) }

Which can then be reduced, or maximally shingled in this particular instance to

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }.

Resemblance

[edit]

For a given shingle size, the degree to which two documentsA andB resemble each other can be expressed as the ratio of the magnitudes of their shinglings'intersection andunion, or

r(A,B)=|S(A)S(B)||S(A)S(B)|{\displaystyle r(A,B)={{|S(A)\cap S(B)|} \over {|S(A)\cup S(B)|}}}

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with theJaccard coefficient describing similarity and diversity of sample sets.

See also

[edit]

References

[edit]
This articleneeds more completecitations forverification. Please helpadd missing citation information so that sources are clearly identifiable.(March 2023) (Learn how and when to remove this message)
General terms
Text analysis
Text segmentation
Automatic summarization
Machine translation
Distributional semantics models
Language resources,
datasets and corpora
Types and
standards
Data
Automatic identification
and data capture
Topic model
Computer-assisted
reviewing
Natural language
user interface
Related
Retrieved from "https://en.wikipedia.org/w/index.php?title=W-shingling&oldid=1293903019"
Category:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp