Abigram ordigram is a sequence of two adjacent elements from astring oftokens, which are typically letters, syllables, or words. A bigram is ann-gram forn=2.
The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including incomputational linguistics,cryptography, andspeech recognition.
Gappy bigrams orskipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in adependency grammar).
Bigrams, along with other n-grams, are used in most successfullanguage models forspeech recognition.[1]
Bigram frequency attacks can be used incryptography to solvecryptograms. Seefrequency analysis.
Bigram frequency is one approach tostatistical language identification.
Some activities inlogology or recreational linguistics involve bigrams. These include attempts to find English words beginning with every possible bigram,[2] or words containing a string of repeated bigrams, such aslogogogue.[3]
The frequency of the most common letter bigrams in a large English corpus is:[4]
th 3.56% of 1.17% io 0.83%he 3.07% ed 1.17% le 0.83%in 2.43% is 1.13% ve 0.83%er 2.05% it 1.12% co 0.79%an 1.99% al 1.09% me 0.79%re 1.85% ar 1.07% de 0.76%on 1.76% st 1.05% hi 0.76%at 1.49% to 1.05% ri 0.73%en 1.45% nt 1.04% ro 0.73%nd 1.35% ng 0.95% ic 0.70%ti 1.34% se 0.93% ne 0.69%es 1.34% ha 0.93% ea 0.69%or 1.28% as 0.87% ra 0.69%te 1.20% ou 0.87% ce 0.65%