Movatterモバイル変換

Bigram

From Wikipedia, the free encyclopedia

Case of an n-gram, where n is 2

Abigram ordigram is a sequence of two adjacent elements from astring oftokens, which are typically letters, syllables, or words. A bigram is ann-gram forn=2.

The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including incomputational linguistics,cryptography, andspeech recognition.

Gappy bigrams orskipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in adependency grammar).

Applications

[edit]

Bigrams, along with other n-grams, are used in most successfullanguage models forspeech recognition.^[1]

Bigram frequency attacks can be used incryptography to solvecryptograms. Seefrequency analysis.

Bigram frequency is one approach tostatistical language identification.

Some activities inlogology or recreational linguistics involve bigrams. These include attempts to find English words beginning with every possible bigram,^[2] or words containing a string of repeated bigrams, such aslogogogue.^[3]

Bigram frequency in the English language

[edit]

The frequency of the most common letter bigrams in a large English corpus is:^[4]

th 3.56%       of 1.17%       io 0.83%he 3.07%       ed 1.17%       le 0.83%in 2.43%       is 1.13%       ve 0.83%er 2.05%       it 1.12%       co 0.79%an 1.99%       al 1.09%       me 0.79%re 1.85%       ar 1.07%       de 0.76%on 1.76%       st 1.05%       hi 0.76%at 1.49%       to 1.05%       ri 0.73%en 1.45%       nt 1.04%       ro 0.73%nd 1.35%       ng 0.95%       ic 0.70%ti 1.34%       se 0.93%       ne 0.69%es 1.34%       ha 0.93%       ea 0.69%or 1.28%       as 0.87%       ra 0.69%te 1.20%       ou 0.87%       ce 0.65%

References

[edit]

^Collins, Michael John (1996-06-24)."A new statistical parser based on bigram lexical dependencies".Proceedings of the 34th annual meeting on Association for Computational Linguistics -. Association for Computational Linguistics. pp. 184–191.arXiv:cmp-lg/9605012.doi:10.3115/981863.981888.S2CID 12615602. Retrieved2018-10-09.
^Cohen, Philip M. (1975)."Initial Bigrams".Word Ways.8 (2). Retrieved11 September 2016.
^Corbin, Kyle (1989)."Double, Triple, and Quadruple Bigrams".Word Ways.22 (3). Retrieved11 September 2016.
^"English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU".norvig.com. Retrieved2019-10-28.

Natural language processing

General terms

Text analysis

Text segmentation	Compound-term processing Lemmatisation Lexical analysis Text chunking Stemming Sentence segmentation Word segmentation

Automatic summarization

Machine translation

Distributional semantics models

Language resources,
datasets and corpora

Types and standards	Corpus linguistics Lexical resource Linguistic Linked Open Data Machine-readable dictionary Parallel text PropBank Semantic network Simple Knowledge Organization System Speech corpus Text corpus Thesaurus (information retrieval) Treebank Universal Dependencies
Data	BabelNet Bank of English DBpedia FrameNet Google Ngram Viewer UBY WordNet Wikidata

Automatic identification
and data capture

Topic model

Computer-assisted
reviewing

Natural language
user interface

Retrieved from "https://en.wikipedia.org/w/index.php?title=Bigram&oldid=1259917234"

Categories:

Hidden categories:

[8]ページ先頭

Movatterモバイル変換

Applications

Bigram frequency in the English language

See also

References