Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

n-gram

From Wikipedia, the free encyclopedia
Item sequences in computational linguistics
For other uses, seeN-gram (disambiguation).
Not to be confused withword n-gram language model orEngram.

Sixn-grams frequently found in titles of publications about Coronavirus disease 2019 (COVID-19), as of 7 May 2020

Ann-gram is a sequence ofn adjacent symbols in particular order.[1] The symbols may ben adjacentletters (includingpunctuation marks and blanks),syllables, or rarely wholewords found in a language dataset; or adjacentphonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from atext corpus orspeech corpus.

N-gram is actually theparent of a family of names term, wherefamily members can be (depending onn numeral) 1-gram, 2-gram etc., or the same using spoken numeral prefixes.

IfLatin numerical prefixes are used, thenn-gram of size 1 is called a "unigram", size 2 a "bigram" (or, less commonly, a "digram") etc. If, instead of the Latin ones, theEnglish cardinal numbers are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, usingGreek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology, forpolymers oroligomers of a known size, calledk-mers. When the items are words,n-grams may also be calledshingles.[2]

In the context ofnatural language processing (NLP), the use ofn-grams allowsbag-of-words models to capture information such as word order, which would not be possible in the traditional bag of words setting.

Examples

[edit]

(Shannon 1951)[3] discussedn-gram models of English. For example:

  • 3-gram character model (random draw based on the probabilities of each trigram):in no ist lat whey cratict froure birs grocid pondenome of demonstures of the retagin is regiactiona of cre
  • 2-gram word model (random draw of words taking into account their transition probabilities):the head and in frontal attack on an english writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected
Figure 1n-gram examples from various disciplines
FieldUnitSample sequence1-gram sequence2-gram sequence3-gram sequence
Vernacular nameunigrambigramtrigram
Order of resultingMarkov model012
Protein sequencingamino acid... Cys-Gly-Leu-Ser-Trp ......, Cys, Gly, Leu, Ser, Trp, ......, Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp, ......, Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp, ...
DNA sequencingbase pair...AGCTTCGA......, A, G, C, T, T, C, G, A, ......, AG, GC, CT, TT, TC, CG, GA, ......, AGC, GCT, CTT, TTC, TCG, CGA, ...
Language modelcharacter...to_be_or_not_to_be......, t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ......, to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ......, to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ...
Wordn-gram language modelword... to be or not to be ......, to, be, or, not, to, be, ......, to be, be or, or not, not to, to be, ......, to be or, be or not, or not to, not to be, ...

Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.

Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Googlen-gram corpus.[4]

3-grams

  • ceramics collectables collectibles (55)
  • ceramics collectables fine (130)
  • ceramics collected by (52)
  • ceramics collectible pottery (50)
  • ceramics collectibles cooking (45)

4-grams

  • serve as the incoming (92)
  • serve as the incubator (99)
  • serve as the independent (794)
  • serve as the index (223)
  • serve as the indication (72)
  • serve as the indicator (120)

References

[edit]
  1. ^"n-gram language model - an overview | ScienceDirect Topics".www.sciencedirect.com. Retrieved12 December 2024.
  2. ^Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey (1997). "Syntactic clustering of the web".Computer Networks and ISDN Systems.29 (8):1157–1166.doi:10.1016/s0169-7552(97)00031-7.S2CID 9022773.
  3. ^Shannon, Claude E. "The redundancy of English."Cybernetics; Transactions of the 7th Conference, New York: Josiah Macy, Jr. Foundation. 1951.
  4. ^Franz, Alex; Brants, Thorsten (2006)."All OurN-gram are Belong to You".Google Research Blog.Archived from the original on 17 October 2006. Retrieved16 December 2011.

Further reading

[edit]

See also

[edit]

External links

[edit]
General terms
Text analysis
Text segmentation
Automatic summarization
Machine translation
Distributional semantics models
Language resources,
datasets and corpora
Types and
standards
Data
Automatic identification
and data capture
Topic model
Computer-assisted
reviewing
Natural language
user interface
Related
Retrieved from "https://en.wikipedia.org/w/index.php?title=N-gram&oldid=1279102511"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp