| Bidirectional encoder representations from transformers (BERT) | |
|---|---|
| Original author | Google AI | 
| Initial release | October 31, 2018 | 
| Repository | github | 
| Type | |
| License | Apache 2.0 | 
| Website | arxiv  | 
Bidirectional encoder representations from transformers (BERT) is alanguage model introduced in October 2018 by researchers atGoogle.[1][2] It learns to represent text as a sequence of vectors usingself-supervised learning. It uses theencoder-only transformer architecture. BERT dramatically improved the state of the art forlarge language models. As of 2020[update], BERT is a ubiquitous baseline innatural language processing (NLP) experiments.[3]
BERT is trained by masked token prediction and next sentence prediction. With this training, BERT learns contextual,latent representations of tokens in their context, similar toELMo andGPT-2.[4] It found applications for many natural language processing tasks, such ascoreference resolution andpolysemy resolution.[5] It improved onELMo and spawned the study of "BERTology", which attempts to interpret what is learned by BERT.[3]
BERT was originally implemented in the English language at two model sizes, BERTBASE (110 million parameters) and BERTLARGE (340 million parameters). Both were trained on the TorontoBookCorpus[6] (800M words) andEnglish Wikipedia (2,500M words).[1]: 5 The weights were released onGitHub.[7] On March 11, 2020, 24 smaller models were released, the smallest being BERTTINY with just 4 million parameters.[7]

BERT is an "encoder-only"transformer architecture. At a high level, BERT consists of 4 modules:
The task head is necessary for pre-training, but it is often unnecessary for so-called "downstream tasks," such asquestion answering orsentiment classification. Instead, one removes the task head and replaces it with a newly initialized module suited for the task, and finetune the new module. The latent vector representation of the model is directly fed into this new module, allowing for sample-efficienttransfer learning.[1][8]

This section describes the embedding used by BERTBASE. The other one, BERTLARGE, is similar, just larger.
The tokenizer of BERT is WordPiece, which is a sub-word strategy likebyte-pair encoding. Its vocabulary size is 30,000, and any token not appearing in its vocabulary is replaced by[UNK] ("unknown").

The first layer is the embedding layer, which contains three components: token type embeddings, position embeddings, and segment type embeddings.
[SEP] special token. All prior tokens are type-0.The three embedding vectors are added together representing the initial token representation as a function of these three pieces of information. After embedding, the vector representation is normalized using aLayerNorm operation, outputting a 768-dimensional vector for each input token. After this, the representation vectors are passed forward through 12 Transformer encoder blocks, and are decoded back to 30,000-dimensional vocabulary space using a basic affine transformation layer.
The encoder stack of BERT has 2 free parameters:, the number of layers, and, thehidden size. There are always self-attention heads, and the feed-forward/filter size is always. By varying these two numbers, one obtains an entire family of BERT models.[9]
For BERT:
The notation for encoder stack is written as L/H. For example, BERTBASE is written as 12L/768H, BERTLARGE as 24L/1024H, and BERTTINY as 2L/128H.
BERT was pre-trained simultaneously on two tasks:[10]
[MASK]," BERT would need to predict "mat." This helps BERT learn bidirectional context, meaning it understands the relationships between words not just from left to right or right to left but from both directions at the same time.
In masked language modeling, 15% of tokens would be randomly selected for masked-prediction task, and the training objective was to predict the masked token given its context. In more detail, the selected token is:
[MASK] token with probability 80%,The reason not all selected tokens are masked is to avoid the dataset shift problem. The dataset shift problem arises when the distribution of inputs seen during training differs significantly from the distribution encountered during inference. A trained BERT model might be applied to word representation (likeWord2Vec), where it would be run over sentences not containing any[MASK] tokens. It is later found that more diverse training objectives are generally better.[11]
As an illustrative example, consider the sentence "my dog is cute". It would first be divided into tokens like "my1 dog2 is3 cute4". Then a random token in the sentence would be picked. Let it be the 4th one "cute4". Next, there would be three possibilities:
[MASK]4";After processing the input text, the model's 4th output vector is passed to its decoder layer, which outputs a probability distribution over its 30,000-dimensional vocabulary space.

Given two sentences, the model predicts if they appear sequentially in the training corpus, outputting either[IsNext] or[NotNext]. During training, the algorithm sometimes samples two sentences from a single continuous span in the training corpus, while at other times, it samples two sentences from two discontinuous spans.
The first sentence starts with a special token,[CLS] (for "classify"). The two sentences are separated by another special token,[SEP] (for "separate"). After processing the two sentences, the final vector for the[CLS] token is passed to a linear layer for binary classification into[IsNext] and[NotNext].
For example:
[CLS] my dog is cute[SEP] he likes playing[SEP]", the model should predict[IsNext].[CLS] my dog is cute[SEP] how do magnets work[SEP]", the model should predict[NotNext].BERT is meant as a general pretrained model for various applications in natural language processing. That is, after pre-training, BERT can befine-tuned with fewer resources on smaller datasets to optimize its performance on specific tasks such asnatural language inference andtext classification, and sequence-to-sequence-based language generation tasks such asquestion answering and conversational response generation.[12]
The original BERT paper published results demonstrating that a small amount of finetuning (for BERTLARGE, 1 hour on 1 Cloud TPU) allowed it to achievedstate-of-the-art performance on a number ofnatural language understanding tasks:[1]
In the original paper, all parameters of BERT are fine-tuned, and recommended that, for downstream applications that are text classifications, the output token at the[CLS] input token is fed into a linear-softmax layer to produce the label outputs.[1]
The original code base defined the final linear layer as a "pooler layer", in analogy withglobal pooling in computer vision, even though it simply discards all output tokens except the one corresponding to[CLS] .[15]
BERT was trained on theBookCorpus (800M words) and a filtered version of English Wikipedia (2,500M words) without lists, tables, and headers.
Training BERTBASE on 4 cloudTPU (16 TPU chips total) took 4 days, at an estimated cost of 500 USD.[7] Training BERTLARGE on 16 cloud TPU (64 TPU chips total) took 4 days.[1]
Language models like ELMo, GPT-2, and BERT, spawned the study of "BERTology", which attempts to interpret what is learned by these models. Their performance on thesenatural language understanding tasks are not yet well understood.[3][16][17] Several research publications in 2018 and 2019 focused on investigating the relationship behind BERT's output as a result of carefully chosen input sequences,[18][19] analysis of internalvector representations through probing classifiers,[20][21] and the relationships represented byattention weights.[16][17]
The high performance of the BERT model could also be attributed to the fact that it is bidirectionally trained.[22] This means that BERT, based on the Transformer model architecture, applies its self-attention mechanism to learn information from a text from the left and right side during training, and consequently gains a deep understanding of the context. For example, the wordfine can have two different meanings depending on the context (I feel finetoday,She has fineblond hair). BERT considers the words surrounding the target wordfine from the left and right side.
However it comes at a cost: due toencoder-only architecture lacking a decoder, BERT can'tbe prompted and can'tgenerate text, while bidirectional models in general do not work effectively without the right side, thus being difficult to prompt. As an illustrative example, if one wishes to use BERT to continue a sentence fragment "Today, I went to", then naively one would mask out all the tokens as "Today, I went to[MASK][MASK][MASK] ...[MASK] ." where the number of[MASK]  is the length of the sentence one wishes to extend to. However, this constitutes a dataset shift, as during training, BERT has never seen sentences with that many tokens masked out. Consequently, its performance degrades. More sophisticated techniques allow text generation, but at a high computational cost.[23]
BERT was originally published by Google researchers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. The design has its origins from pre-training contextual representations, includingsemi-supervised sequence learning,[24] generative pre-training,ELMo,[25] and ULMFit.[26] Unlike previous models, BERT is a deeply bidirectional,unsupervised language representation, pre-trained using only a plaintext corpus. Context-free models such asword2vec orGloVe generate a single word embedding representation for each word in the vocabulary, whereas BERT takes into account the context for each occurrence of a given word. For instance, whereas the vector for "running" will have the same word2vec vector representation for both of its occurrences in the sentences "He is running a company" and "He is running a marathon", BERT will provide a contextualized embedding that will be different according to the sentence.[4]
On October 25, 2019,Google announced that they had started applying BERT models toEnglish-language search queries onGoogle Search within the US.[27] On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages.[28][29] In October 2020, almost every single English-based query was processed by a BERT model.[30]
The BERT models were influential and inspired many variants.
RoBERTa (2019)[31] was an engineering improvement. It preserves BERT's architecture (slightly larger, at 355M parameters), but improves its training, changing key hyperparameters, removing thenext-sentence prediction task, and using much largermini-batch sizes.
XLM-RoBERTa (2019)[32] was a multilingual RoBERTa model. It was one of the first works on multilingual language modeling at scale.
DistilBERT (2019)distills BERTBASE to a model with just 60% of its parameters (66M), while preserving 95% of its benchmark scores.[33][34] Similarly,TinyBERT (2019)[35] is a distilled model with just 28% of its parameters.
ALBERT (2019)[36] used shared-parameter across layers, and experimented with independently varying the hidden size and the word-embedding layer's output size as two hyperparameters. They also replaced thenext sentence prediction task with thesentence-order prediction (SOP) task, where the model must distinguish the correct order of two consecutive text segments from their reversed order.
ELECTRA (2020)[37] applied the idea ofgenerative adversarial networks to the MLM task. Instead of masking out tokens, a small language model generates random plausible substitutions, and a larger network identify these replaced tokens. The small model aims to fool the large model.
DeBERTa (2020)[38] is a significant architectural variant, withdisentangled attention. Its key idea is to treat the positional and token encodings separately throughout the attention mechanism. Instead of combining the positional encoding () and token encoding () into a single input vector (), DeBERTa keeps them separate as a tuple:. Then, at each self-attention layer, DeBERTa computes three distinct attention matrices, rather than the single attention matrix used in BERT:[note 1]
| Attention type | Query type | Key type | Example | 
|---|---|---|---|
| Content-to-content | Token | Token | "European"; "Union", "continent" | 
| Content-to-position | Token | Position | [adjective]; +1, +2, +3 | 
| Position-to-content | Position | Token | −1; "not", "very" | 
The three attention matrices are added together element-wise, then passed through a softmax layer and multiplied by a projection matrix.
Absolute position encoding is included in the final self-attention layer as additional input.