Movatterモバイル変換

Jump to content

BERT (language model)

From Wikipedia, the free encyclopedia

Series of language models developed by Google AI

Bidirectional encoder representations from transformers (BERT)
Original author	Google AI
Initial release	October 31, 2018
Repository	github.com/google-research/bert
Type	Large language model Transformer Foundation model
License	Apache 2.0
Website	arxiv.org/abs/1810.04805

Bidirectional encoder representations from transformers (BERT) is alanguage model introduced in October 2018 by researchers atGoogle.^[1]^[2] It learns to represent text as a sequence of vectors usingself-supervised learning. It uses theencoder-only transformer architecture. BERT dramatically improved the state of the art forlarge language models. As of 2020^[update], BERT is a ubiquitous baseline innatural language processing (NLP) experiments.^[3]

BERT is trained by masked token prediction and next sentence prediction. With this training, BERT learns contextual,latent representations of tokens in their context, similar toELMo andGPT-2.^[4] It found applications for many natural language processing tasks, such ascoreference resolution andpolysemy resolution.^[5] It improved onELMo and spawned the study of "BERTology", which attempts to interpret what is learned by BERT.^[3]

BERT was originally implemented in the English language at two model sizes, BERT_BASE (110 million parameters) and BERT_LARGE (340 million parameters). Both were trained on the TorontoBookCorpus^[6] (800M words) andEnglish Wikipedia (2,500M words).^[1]^: 5 The weights were released onGitHub.^[7] On March 11, 2020, 24 smaller models were released, the smallest being BERT_TINY with just 4 million parameters.^[7]

Architecture

High-level schematic diagram of BERT. It takes in a text, tokenizes it into a sequence of tokens, add in optional special tokens, and apply a Transformer encoder. The hidden states of the last layer can then be used as contextual word embeddings.

BERT is an "encoder-only"transformer architecture. At a high level, BERT consists of 4 modules:

Tokenizer: This module converts a piece of English text into a sequence of integers ("tokens").
Embedding: This module converts the sequence of tokens into an array of real-valued vectors representing the tokens. It represents the conversion of discrete token types into a lower-dimensionalEuclidean space.
Encoder: a stack of Transformer blocks withself-attention, but without causal masking.
Task head: This module converts the final representation vectors into one-hot encoded tokens again by producing a predicted probability distribution over the token types. It can be viewed as a simple decoder, decoding the latent representation into token types, or as an "un-embedding layer".

The task head is necessary for pre-training, but it is often unnecessary for so-called "downstream tasks," such asquestion answering orsentiment classification. Instead, one removes the task head and replaces it with a newly initialized module suited for the task, and finetune the new module. The latent vector representation of the model is directly fed into this new module, allowing for sample-efficienttransfer learning.^[1]^[8]

Encoder-only attention is all-to-all.

Embedding

This section describes the embedding used by BERT_BASE. The other one, BERT_LARGE, is similar, just larger.

The tokenizer of BERT is WordPiece, which is a sub-word strategy likebyte-pair encoding. Its vocabulary size is 30,000, and any token not appearing in its vocabulary is replaced by[UNK] ("unknown").

The three kinds of embedding used by BERT: token type, position, and segment type.

The first layer is the embedding layer, which contains three components: token type embeddings, position embeddings, and segment type embeddings.

Token type: The token type is a standard embedding layer, translating a one-hot vector into a dense vector based on its token type.
Position: The position embeddings are based on a token's position in the sequence. BERT uses absolute position embeddings, where each position in a sequence is mapped to a real-valued vector. Each dimension of the vector consists of asinusoidal function that takes the position in the sequence as input.
Segment type: Using a vocabulary of just 0 or 1, this embedding layer produces a dense vector based on whether the token belongs to the first or second text segment in that input. In other words, type-1 tokens are all tokens that appear after the[SEP] special token. All prior tokens are type-0.

The three embedding vectors are added together representing the initial token representation as a function of these three pieces of information. After embedding, the vector representation is normalized using aLayerNorm operation, outputting a 768-dimensional vector for each input token. After this, the representation vectors are passed forward through 12 Transformer encoder blocks, and are decoded back to 30,000-dimensional vocabulary space using a basic affine transformation layer.

Architectural family

The encoder stack of BERT has 2 free parameters: $L {\displaystyle L}$ , the number of layers, and $H {\displaystyle H}$ , thehidden size. There are always $H/64$ self-attention heads, and the feed-forward/filter size is always $4H$ . By varying these two numbers, one obtains an entire family of BERT models.^[9]

For BERT:

thefeed-forward size andfilter size are synonymous. Both of them denote the number of dimensions in the middle layer of the feed-forward network.
thehidden size andembedding size are synonymous. Both of them denote the number of real numbers used to represent a token.

The notation for encoder stack is written as L/H. For example, BERT_BASE is written as 12L/768H, BERT_LARGE as 24L/1024H, and BERT_TINY as 2L/128H.

Training

Pre-training

BERT was pre-trained simultaneously on two tasks:^[10]

Masked language modeling (MLM): In this task, BERT ingests a sequence of words, where one word may be randomly changed ("masked"), and BERT tries to predict the original words that had been changed. For example, in the sentence "The cat sat on the[MASK]," BERT would need to predict "mat." This helps BERT learn bidirectional context, meaning it understands the relationships between words not just from left to right or right to left but from both directions at the same time.

Next sentence prediction (NSP): In this task, BERT is trained to predict whether one sentence logically follows another. For example, given two sentences, "The cat sat on the mat" and "It was a sunny day", BERT has to decide if the second sentence is a valid continuation of the first one. This helps BERT understand relationships between sentences, which is important for tasks like question answering or document classification.

Masked language modeling

The masked language modeling task

In masked language modeling, 15% of tokens would be randomly selected for masked-prediction task, and the training objective was to predict the masked token given its context. In more detail, the selected token is:

replaced with a[MASK] token with probability 80%,
replaced with a random word token with probability 10%,
not replaced with probability 10%.

The reason not all selected tokens are masked is to avoid the dataset shift problem. The dataset shift problem arises when the distribution of inputs seen during training differs significantly from the distribution encountered during inference. A trained BERT model might be applied to word representation (likeWord2Vec), where it would be run over sentences not containing any[MASK] tokens. It is later found that more diverse training objectives are generally better.^[11]

As an illustrative example, consider the sentence "my dog is cute". It would first be divided into tokens like "my₁ dog₂ is₃ cute₄". Then a random token in the sentence would be picked. Let it be the 4th one "cute₄". Next, there would be three possibilities:

with probability 80%, the chosen token is masked, resulting in "my₁ dog₂ is₃[MASK]₄";
with probability 10%, the chosen token is replaced by a uniformly sampled random token, such as "happy", resulting in "my₁ dog₂ is₃ happy₄";
with probability 10%, nothing is done, resulting in "my₁ dog₂ is₃ cute₄".

After processing the input text, the model's 4th output vector is passed to its decoder layer, which outputs a probability distribution over its 30,000-dimensional vocabulary space.

Next sentence prediction

The next sentence prediction task

Given two sentences, the model predicts if they appear sequentially in the training corpus, outputting either[IsNext] or[NotNext]. During training, the algorithm sometimes samples two sentences from a single continuous span in the training corpus, while at other times, it samples two sentences from two discontinuous spans.

The first sentence starts with a special token,[CLS] (for "classify"). The two sentences are separated by another special token,[SEP] (for "separate"). After processing the two sentences, the final vector for the[CLS] token is passed to a linear layer for binary classification into[IsNext] and[NotNext].

For example:

Given "[CLS] my dog is cute[SEP] he likes playing[SEP]", the model should predict[IsNext].
Given "[CLS] my dog is cute[SEP] how do magnets work[SEP]", the model should predict[NotNext].

Fine-tuning

Fine-tuned tasks for BERT^[12]

Sentiment classification
Sentence classification
Answering multiple-choice questions
Part-of-speech tagging

BERT is meant as a general pretrained model for various applications in natural language processing. That is, after pre-training, BERT can befine-tuned with fewer resources on smaller datasets to optimize its performance on specific tasks such asnatural language inference andtext classification, and sequence-to-sequence-based language generation tasks such asquestion answering and conversational response generation.^[12]

The original BERT paper published results demonstrating that a small amount of finetuning (for BERT_LARGE, 1 hour on 1 Cloud TPU) allowed it to achievedstate-of-the-art performance on a number ofnatural language understanding tasks:^[1]

GLUE (General Language Understanding Evaluation) task set (consisting of 9 tasks);
SQuAD (Stanford Question Answering Dataset^[13]) v1.1 and v2.0;
SWAG (Situations With Adversarial Generations^[14]).

In the original paper, all parameters of BERT are fine-tuned, and recommended that, for downstream applications that are text classifications, the output token at the[CLS] input token is fed into a linear-softmax layer to produce the label outputs.^[1]

The original code base defined the final linear layer as a "pooler layer", in analogy withglobal pooling in computer vision, even though it simply discards all output tokens except the one corresponding to[CLS] .^[15]

Cost

BERT was trained on theBookCorpus (800M words) and a filtered version of English Wikipedia (2,500M words) without lists, tables, and headers.

Training BERT_BASE on 4 cloudTPU (16 TPU chips total) took 4 days, at an estimated cost of 500 USD.^[7] Training BERT_LARGE on 16 cloud TPU (64 TPU chips total) took 4 days.^[1]

Interpretation

Language models like ELMo, GPT-2, and BERT, spawned the study of "BERTology", which attempts to interpret what is learned by these models. Their performance on thesenatural language understanding tasks are not yet well understood.^[3]^[16]^[17] Several research publications in 2018 and 2019 focused on investigating the relationship behind BERT's output as a result of carefully chosen input sequences,^[18]^[19] analysis of internalvector representations through probing classifiers,^[20]^[21] and the relationships represented byattention weights.^[16]^[17]

The high performance of the BERT model could also be attributed to the fact that it is bidirectionally trained.^[22] This means that BERT, based on the Transformer model architecture, applies its self-attention mechanism to learn information from a text from the left and right side during training, and consequently gains a deep understanding of the context. For example, the wordfine can have two different meanings depending on the context (I feel finetoday,She has fineblond hair). BERT considers the words surrounding the target wordfine from the left and right side.

However it comes at a cost: due toencoder-only architecture lacking a decoder, BERT can'tbe prompted and can'tgenerate text, while bidirectional models in general do not work effectively without the right side, thus being difficult to prompt. As an illustrative example, if one wishes to use BERT to continue a sentence fragment "Today, I went to", then naively one would mask out all the tokens as "Today, I went to[MASK][MASK][MASK] ...[MASK] ." where the number of[MASK] is the length of the sentence one wishes to extend to. However, this constitutes a dataset shift, as during training, BERT has never seen sentences with that many tokens masked out. Consequently, its performance degrades. More sophisticated techniques allow text generation, but at a high computational cost.^[23]

History

BERT was originally published by Google researchers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. The design has its origins from pre-training contextual representations, includingsemi-supervised sequence learning,^[24] generative pre-training,ELMo,^[25] and ULMFit.^[26] Unlike previous models, BERT is a deeply bidirectional,unsupervised language representation, pre-trained using only a plaintext corpus. Context-free models such asword2vec orGloVe generate a single word embedding representation for each word in the vocabulary, whereas BERT takes into account the context for each occurrence of a given word. For instance, whereas the vector for "running" will have the same word2vec vector representation for both of its occurrences in the sentences "He is running a company" and "He is running a marathon", BERT will provide a contextualized embedding that will be different according to the sentence.^[4]

On October 25, 2019,Google announced that they had started applying BERT models toEnglish-language search queries onGoogle Search within the US.^[27] On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages.^[28]^[29] In October 2020, almost every single English-based query was processed by a BERT model.^[30]

Variants

The BERT models were influential and inspired many variants.

RoBERTa (2019)^[31] was an engineering improvement. It preserves BERT's architecture (slightly larger, at 355M parameters), but improves its training, changing key hyperparameters, removing thenext-sentence prediction task, and using much largermini-batch sizes.

XLM-RoBERTa (2019)^[32] was a multilingual RoBERTa model. It was one of the first works on multilingual language modeling at scale.

DistilBERT (2019)distills BERT_BASE to a model with just 60% of its parameters (66M), while preserving 95% of its benchmark scores.^[33]^[34] Similarly,TinyBERT (2019)^[35] is a distilled model with just 28% of its parameters.

ALBERT (2019)^[36] used shared-parameter across layers, and experimented with independently varying the hidden size and the word-embedding layer's output size as two hyperparameters. They also replaced thenext sentence prediction task with thesentence-order prediction (SOP) task, where the model must distinguish the correct order of two consecutive text segments from their reversed order.

ELECTRA (2020)^[37] applied the idea ofgenerative adversarial networks to the MLM task. Instead of masking out tokens, a small language model generates random plausible substitutions, and a larger network identify these replaced tokens. The small model aims to fool the large model.

DeBERTa (2020)^[38] is a significant architectural variant, withdisentangled attention. Its key idea is to treat the positional and token encodings separately throughout the attention mechanism. Instead of combining the positional encoding ( $x_{\mathrm {position} }$ ) and token encoding ( $x_{\mathrm {token} }$ ) into a single input vector ( $x_{\mathrm {input} }=x_{\mathrm {position} }+x_{\mathrm {token} }$ ), DeBERTa keeps them separate as a tuple: $(x_{\mathrm {position} },x_{\mathrm {token} })$ . Then, at each self-attention layer, DeBERTa computes three distinct attention matrices, rather than the single attention matrix used in BERT:^{[note 1]}


Attention type	Query type	Key type	Example
Content-to-content	Token	Token	"European"; "Union", "continent"
Content-to-position	Token	Position	[adjective]; +1, +2, +3
Position-to-content	Position	Token	−1; "not", "very"

The three attention matrices are added together element-wise, then passed through a softmax layer and multiplied by a projection matrix.

Absolute position encoding is included in the final self-attention layer as additional input.

Notes

^The position-to-position type was omitted by the authors for being useless.

References

^^a ^b ^c ^d ^e ^fDevlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (October 11, 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".arXiv:1810.04805v2 [cs.CL].
^"Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing".Google AI Blog. November 2, 2018. RetrievedNovember 27, 2019.
^^a ^b ^cRogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020)."A Primer in BERTology: What We Know About How BERT Works".Transactions of the Association for Computational Linguistics.8:842–866.arXiv:2002.12327.doi:10.1162/tacl_a_00349.S2CID 211532403.
^^a ^bEthayarajh, Kawin (September 1, 2019),How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings,arXiv:1909.00512
^Anderson, Dawn (November 5, 2019)."A deep dive into BERT: How BERT launched a rocket into natural language understanding".Search Engine Land. RetrievedAugust 6, 2024.
^Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). "Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books". pp. 19–27.arXiv:1506.06724 [cs.CV].
^^a ^b ^c"BERT".GitHub. RetrievedMarch 28, 2023.
^Zhang, Tianyi; Wu, Felix; Katiyar, Arzoo; Weinberger, Kilian Q.; Artzi, Yoav (March 11, 2021),Revisiting Few-sample BERT Fine-tuning,arXiv:2006.05987
^Turc, Iulia; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (September 25, 2019),Well-Read Students Learn Better: On the Importance of Pre-training Compact Models,arXiv:1908.08962
^"Summary of the models — transformers 3.4.0 documentation".huggingface.co. RetrievedFebruary 16, 2023.
^Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara (February 28, 2023),UL2: Unifying Language Learning Paradigms,arXiv:2205.05131
^^a ^bZhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024)."11.9. Large-Scale Pretraining with Transformers".Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press.ISBN 978-1-009-38943-3.
^Rajpurkar, Pranav; Zhang, Jian; Lopyrev, Konstantin; Liang, Percy (October 10, 2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text".arXiv:1606.05250 [cs.CL].
^Zellers, Rowan; Bisk, Yonatan; Schwartz, Roy; Choi, Yejin (August 15, 2018). "SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference".arXiv:1808.05326 [cs.CL].
^"bert/modeling.py at master · google-research/bert".GitHub. RetrievedSeptember 16, 2024.
^^a ^bKovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna (November 2019)."Revealing the Dark Secrets of BERT".Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 4364–4373.doi:10.18653/v1/D19-1445.S2CID 201645145.
^^a ^bClark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (2019)."What Does BERT Look at? An Analysis of BERT's Attention".Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics:276–286.arXiv:1906.04341.doi:10.18653/v1/w19-4828.
^Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan (2018). "Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context".Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics:284–294.arXiv:1805.04623.doi:10.18653/v1/p18-1027.S2CID 21700944.
^Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco (2018). "Colorless Green Recurrent Networks Dream Hierarchically".Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 1195–1205.arXiv:1803.11138.doi:10.18653/v1/n18-1108.S2CID 4460159.
^Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem (2018). "Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information".Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics:240–248.arXiv:1808.08079.doi:10.18653/v1/w18-5426.S2CID 52090220.
^Zhang, Kelly; Bowman, Samuel (2018)."Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis".Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics:359–361.doi:10.18653/v1/w18-5448.
^Sur, Chiranjib (January 2020)."RBN: enhancement in language attribute prediction using global representation of natural language transfer learning technology like Google BERT".SN Applied Sciences.2 (1) 22.doi:10.1007/s42452-019-1765-9.
^Patel, Ajay; Li, Bryan; Mohammad Sadegh Rasooli; Constant, Noah; Raffel, Colin; Callison-Burch, Chris (2022). "Bidirectional Language Models Are Also Few-shot Learners".arXiv:2209.14500 [cs.LG].
^Dai, Andrew; Le, Quoc (November 4, 2015). "Semi-supervised Sequence Learning".arXiv:1511.01432 [cs.LG].
^Peters, Matthew; Neumann, Mark; Iyyer, Mohit; Gardner, Matt; Clark, Christopher; Lee, Kenton; Luke, Zettlemoyer (February 15, 2018). "Deep contextualized word representations".arXiv:1802.05365v2 [cs.CL].
^Howard, Jeremy; Ruder, Sebastian (January 18, 2018). "Universal Language Model Fine-tuning for Text Classification".arXiv:1801.06146v5 [cs.CL].
^Nayak, Pandu (October 25, 2019)."Understanding searches better than ever before".Google Blog. RetrievedDecember 10, 2019.
^"Understanding searches better than ever before".Google. October 25, 2019. RetrievedAugust 6, 2024.
^Montti, Roger (December 10, 2019)."Google's BERT Rolls Out Worldwide".Search Engine Journal. RetrievedDecember 10, 2019.
^"Google: BERT now used on almost every English query".Search Engine Land. October 15, 2020. RetrievedNovember 24, 2020.
^Liu, Yinhan; Ott, Myle; Goyal, Naman; Du, Jingfei; Joshi, Mandar; Chen, Danqi; Levy, Omer; Lewis, Mike; Zettlemoyer, Luke; Stoyanov, Veselin (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach".arXiv:1907.11692 [cs.CL].
^Conneau, Alexis; Khandelwal, Kartikay; Goyal, Naman; Chaudhary, Vishrav; Wenzek, Guillaume; Guzmán, Francisco; Grave, Edouard; Ott, Myle; Zettlemoyer, Luke; Stoyanov, Veselin (2019). "Unsupervised Cross-lingual Representation Learning at Scale".arXiv:1911.02116 [cs.CL].
^Sanh, Victor; Debut, Lysandre; Chaumond, Julien; Wolf, Thomas (February 29, 2020),DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,arXiv:1910.01108
^"DistilBERT".huggingface.co. RetrievedAugust 5, 2024.
^Jiao, Xiaoqi; Yin, Yichun; Shang, Lifeng; Jiang, Xin; Chen, Xiao; Li, Linlin; Wang, Fang; Liu, Qun (October 15, 2020),TinyBERT: Distilling BERT for Natural Language Understanding,arXiv:1909.10351
^Lan, Zhenzhong; Chen, Mingda; Goodman, Sebastian; Gimpel, Kevin; Sharma, Piyush; Soricut, Radu (February 8, 2020),ALBERT: A Lite BERT for Self-supervised Learning of Language Representations,arXiv:1909.11942
^Clark, Kevin; Luong, Minh-Thang; Le, Quoc V.; Manning, Christopher D. (March 23, 2020),ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators,arXiv:2003.10555
^He, Pengcheng; Liu, Xiaodong; Gao, Jianfeng; Chen, Weizhu (October 6, 2021),DeBERTa: Decoding-enhanced BERT with Disentangled Attention,arXiv:2006.03654

Further reading

Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). "A Primer in BERTology: What we know about how BERT works".arXiv:2002.12327 [cs.CL].

External links

Official GitHub repository

v
t
e

Computer
programs

AlphaGo

Versions	AlphaGo (2015) Master (2016) AlphaGo Zero (2017) AlphaZero (2017) MuZero (2019)
Competitions	Fan Hui (2015) Lee Sedol (2016) Ke Jie (2017)
In popular culture	AlphaGo (2017) The MANIAC (2023)

Other

AlphaFold (2018)
AlphaStar (2019)
AlphaDev (2023)
AlphaGeometry (2024)
AlphaGenome (2025)

Machine
learning

Neural networks	Inception (2014) WaveNet (2016) MobileNet (2017) Transformer (2017) EfficientNet (2019) Gato (2022)
Other	Quantum Artificial Intelligence Lab TensorFlow Tensor Processing Unit

Generative
AI

Chatbots	Assistant (2016) Sparrow (2022) Gemini (2023) Nano Banana
Models	BERT (2018) XLNet (2019) T5 (2019) LaMDA (2021) Chinchilla (2022) PaLM (2022) Imagen (2023) Gemini (2023) VideoPoet (2024) Gemma (2024) Veo (2024)
Other	DreamBooth (2022) NotebookLM (2023) Vids (2024) Gemini Robotics (2025)

See also

v
t
e

a subsidiary ofAlphabet

Company

Divisions

Subsidiaries

Active

Defunct

Programs

Current	Krishna Bharat Vint Cerf Jeff Dean John Doerr Sanjay Ghemawat Al Gore John L. Hennessy Urs Hölzle Salar Kamangar Ray Kurzweil Ann Mather Alan Mulally Rick Osterloh Sundar Pichai (CEO) Ruth Porat (CFO) Rajen Sheth Hal Varian Neal Mohan
Former	Andy Bechtolsheim Sergey Brin (co-founder) David Cheriton Matt Cutts David Drummond Alan Eustace Timnit Gebru Omid Kordestani Paul Otellini Larry Page (co-founder) Patrick Pichette Eric Schmidt Ram Shriram Amit Singhal Shirley M. Tilghman Rachel Whetstone Susan Wojcicki

General	Censorship DeGoogle FairSearch "Google's Ideological Echo Chamber" No Tech for Apartheid Privacy concerns Street View YouTube Trade unions Alphabet Workers Union YouTube copyright issues
Incidents	Backdoor advertisement controversy Blocking of YouTube videos in Germany Data breach Elsagate Fantastic Adventures scandal Kohistan video case Reactions toInnocence of Muslims San Francisco tech bus protests Services outages Slovenian government incident Walkouts YouTube headquarters shooting

Other

Software

A–C	Accelerated Linear Algebra AMP Actions on Google ALTS American Fuzzy Lop Android Cloud to Device Messaging Android Debug Bridge Android NDK Android Runtime Android SDK Android Studio Angular AngularJS Apache Beam APIs App Engine App Inventor App Maker App Runtime for Chrome AppJet Apps Script AppSheet ARCore Base Bazel BeyondCorp Bigtable BigQuery Bionic Blockly Borg Caja Cameyo Chart API Charts Chrome Frame Chromium Blink Closure Tools Cloud Connect Cloud Dataflow Cloud Datastore Cloud Messaging Cloud Shell Cloud Storage Code Search Compute Engine Cpplint
D–N	Dalvik Data Protocol Dialogflow Exposure Notification Fast Pair Fastboot Federated Learning of Cohorts File System Firebase Firebase Studio Firebase Cloud Messaging FlatBuffers Flutter Freebase Gadgets Ganeti Gears Gerrit GLOP gRPC Gson Guava Guetzli Guice gVisor GYP JAX Jetpack Compose Keyhole Markup Language Kubernetes Kythe LevelDB Lighthouse Looker Studio lmctfy MapReduce Mashup Editor Matter Mobile Services Namebench Native Client Neatx Neural Machine Translation Nomulus
O–Z	Open Location Code OpenRefine OpenSocial Optimize OR-Tools Pack PageSpeed Piper Plugin for Eclipse Polymer Programmable Search Engine Project Shield Public DNS reCAPTCHA RenderScript SafetyNet SageTV Schema.org Search Console Shell Sitemaps Skia Graphics Engine Spanner Sputnik Stackdriver Swiffy Tango TensorFlow Tesseract Test Translator Toolkit Urchin UTM parameters V8 VirusTotal VisBug Wave Federation Protocol Weave Web Accelerator Web Designer Web Server Web Toolkit Webdriver Torso WebRTC

Operating systems

Machine learning models

Neural networks

Computer programs

Formats and codecs

Programming languages

Search algorithms

Typefaces

A	Aardvark Account Dashboard Takeout Ad Manager AdMob Ads AdSense Affiliate Network Alerts Allo Analytics Android Auto Android Beam Answers Apture Arts & Culture Assistant Attribution Authenticator
B	BebaPay BeatThatQuote.com Beam Blog Search Blogger Body Bookmarks Books Ngram Viewer Browser Sync Building Maker Bump BumpTop Buzz
C	Calendar Cast Catalogs Chat Checkout Chrome Chrome Apps Chrome Experiments Chrome Remote Desktop Chrome Web Store Classroom Cloud Print Cloud Search Contacts Contributor Crowdsource Currents (social app) Currents (news app)
D	Data Commons Dataset Search Desktop Dictionary Dinosaur Game Directory Docs Docs Editors Domains Drawings Drive Duo
E	Earth Etherpad Expeditions Express
F	Family Link Fast Flip FeedBurner fflick Fi Wireless Finance Files Find Hub Fit Flights Flu Trends Fonts Forms Friend Connect Fusion Tables
G	Gboard Gemini Nano Banana Gesture Search Gizmo5 Google+ Gmail Goggles GOOG-411 Grasshopper Groups
H	Hangouts Helpouts Home
I	iGoogle Images Image Labeler Image Swirl Inbox by Gmail Input Tools Japanese Input Pinyin Insights for Search
J	Jaiku Jamboard
K	Kaggle Keep Knol
L	Labs Latitude Lens Like.com Live Transcribe Lively
M	Map Maker Maps Maps Navigation Marketing Platform Meet Messages Moderator My Tracks
N	Nearby Share News News & Weather News Archive Notebook NotebookLM Now
O	Offers One One Pass Opinion Rewards Orkut Oyster
P	Panoramio PaperofRecord.com Patents Page Creator Pay (mobile app) Pay (payment method) Pay Send People Cards Person Finder Personalized Search Photomath Photos Picasa Picasa Web Albums Picnik Pixel Camera Play Play Books Play Games Play Music Play Newsstand Play Pass Play Services Podcasts Poly Postini PostRank Primer Public Alerts Public Data Explorer
Q	Question Hub Quick, Draw! Quick Search Box Quick Share Quickoffice
R	Read Along Reader Reply
S	Safe Browsing SageTV Santa Tracker Schemer Scholar Search AI Overviews Knowledge Graph SafeSearch Searchwiki Sheets Shoploop Shopping Sidewiki Sites Slides Snapseed Socratic Softcard Songza Sound Amplifier Spaces Sparrow (chatbot) Sparrow (email client) Speech Recognition & Synthesis Squared Stadia Station Store Street View Surveys Sync
T	Tables Talk TalkBack Tasks Tenor Tez Tilt Brush Toolbar Toontastic 3D Translate Travel Trendalyzer Trends TV
U	URL Shortener
V	Video Vids Voice Voice Access Voice Search
W	Wallet Wave Waze WDYL Web Light Where Is My Train Widevine Wiz Word Lens Workspace Workspace Marketplace
Y	YouTube YouTube Kids YouTube Music YouTube Premium YouTube Shorts YouTube Studio YouTube TV YouTube VR

Hardware

Smartphones	Pixel (2016) Pixel 2 (2017) Pixel 3 (2018) Pixel 3a (2019) Pixel 4 (2019) Pixel 4a (2020) Pixel 5 (2020) Pixel 5a (2021) Pixel 6 (2021) Pixel 6a (2022) Pixel 7 (2022) Pixel 7a (2023) Pixel Fold (2023) Pixel 8 (2023) Pixel 8a (2024) Pixel 9 (2024) Pixel 9 Pro Fold (2024) Pixel 9a (2025) Pixel 10 (2025) Pixel 10 Pro Fold (2025)
Smartwatches	Pixel Watch (2022) Pixel Watch 2 (2023) Pixel Watch 3 (2024) Pixel Watch 4 (2025)
Tablets	Pixel C (2015) Pixel Slate (2018) Pixel Tablet (2023)
Laptops	Chromebook Pixel (2013–2015) Pixelbook (2017) Pixelbook Go (2019)
Other	Pixel Buds (2017–present)

Smartphones	Nexus One (2010) Nexus S (2010) Galaxy Nexus (2011) Nexus 4 (2012) Nexus 5 (2013) Nexus 6 (2014) Nexus 5X (2015) Nexus 6P (2015)
Tablets	Nexus 7 (2012) Nexus 10 (2012) Nexus 7 (2013) Nexus 9 (2014)
Other	Nexus Q (2012) Nexus Player (2014)

Other

v t e Litigation
Advertising	Feldman v. Google, Inc. (2007) Rescuecom Corp. v. Google Inc. (2009) Goddard v. Google, Inc. (2009) Rosetta Stone Ltd. v. Google, Inc. (2012) Google, Inc. v. American Blind & Wallpaper Factory, Inc. (2017) Jedi Blue
Antitrust	European Union (2010–present) United States v. Adobe Systems, Inc., Apple Inc., Google Inc., Intel Corporation, Intuit, Inc., and Pixar (2011) Umar Javeed, Sukarma Thapar, Aaqib Javeed vs. Google LLC and Ors. (2019) United States v. Google LLC (2020) United States v. Google LLC (2023)
Intellectual property	Perfect 10, Inc. v. Amazon.com, Inc. (2007) Viacom International, Inc. v. YouTube, Inc. (2010) Lenz v. Universal Music Corp.(2015) Authors Guild, Inc. v. Google, Inc. (2015) Field v. Google, Inc. (2016) Google LLC v. Oracle America, Inc. (2021) Smartphone patent wars
Privacy	Rocky Mountain Bank v. Google, Inc. (2009) Hibnick v. Google, Inc. (2010) United States v. Google Inc. (2012) Judgement of the German Federal Court of Justice on Google's autocomplete function (2013) Joffe v. Google, Inc. (2013) Mosley v SARL Google (2013) Google Spain v AEPD and Mario Costeja González (2014) Frank v. Gaos (2019)
Other	Garcia v. Google, Inc. (2015) Google LLC v Defteros (2020) Epic Games v. Google (2021) Gonzalez v. Google LLC (2022)

Related

Concepts

Products

Android	Booting process Custom distributions Features Recovery mode Software development
Street View coverage	Africa Antarctica Asia Israel Europe North America Canada United States Oceania South America Argentina Chile Colombia
YouTube	Copyright strike Education Features Moderation Most-disliked videos Most-liked videos Most-subscribed channels Most-viewed channels Most-viewed videos Arabic music videos Chinese music videos French music videos Indian videos Pakistani videos Official channel Social impact YouTube Premium original programming
Other	Gmail interface Maps pin Most downloaded Google Play applications Stadia games

Popular culture

Google Feud
Google Me (film)
"Google Me" (Kim Zolciak song)
"Google Me" (Teyana Taylor song)
Is Google Making Us Stupid?
Proceratium google
Matt Nathanson: Live at Google
The Billion Dollar Code
The Internship
Where on Google Earth is Carmen Sandiego?

Other

Italics denotediscontinued products.

v
t
e

Natural language processing

General terms

Text segmentation	Compound-term processing Lemmatisation Lexical analysis Text chunking Stemming Sentence segmentation Word segmentation

Automatic summarization

Machine translation

Distributional semantics models

Language resources,
datasets and corpora

Types and standards	Corpus linguistics Lexical resource Linguistic Linked Open Data Machine-readable dictionary Parallel text PropBank Semantic network Simple Knowledge Organization System Speech corpus Text corpus Thesaurus (information retrieval) Treebank Universal Dependencies
Data	BabelNet Bank of English DBpedia FrameNet Google Ngram Viewer UBY WordNet Wikidata

Automatic identification
and data capture

Computer-assisted
reviewing

Natural language
user interface

Related

v
t
e

Artificial intelligence (AI)

Concepts

Applications

Implementations

Audio–visual	AlexNet WaveNet Human image synthesis HWR OCR Computer vision Speech synthesis 15.ai ElevenLabs Speech recognition Whisper Facial recognition AlphaFold Text-to-image models Aurora DALL-E Firefly Flux Ideogram Imagen Midjourney Recraft Stable Diffusion Text-to-video models Dream Machine Runway Gen Hailuo AI Kling Sora Veo Music generation Riffusion Suno AI Udio
Text	Word2vec Seq2seq GloVe BERT T5 Llama Chinchilla AI PaLM GPT 1 2 3 J ChatGPT 4 4o o1 o3 4.5 4.1 o4-mini 5 Claude Gemini Gemini (language model) Gemma Grok LaMDA BLOOM DBRX Project Debater IBM Watson IBM Watsonx Granite PanGu-Σ DeepSeek Qwen
Decisional	AlphaGo AlphaZero OpenAI Five Self-driving car MuZero Action selection AutoGPT Robot control

People

Architectures

Category

Retrieved from "https://en.wikipedia.org/w/index.php?title=BERT_(language_model)&oldid=1311272467"

Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp