Large language models Stay organized with collections Save and categorize content based on your preferences.
Page Summary
This module explores language models, which estimate the probability of a token or sequence of tokens occurring within a longer sequence, enabling tasks like text generation, translation, and summarization.
Language models utilize context, the surrounding information of a target token, to enhance prediction accuracy, with recurrent neural networks offering more context than traditional N-grams.
N-grams are ordered sequences of words used to build language models, with longer N-grams providing more context but potentially encountering sparsity issues.
Tokens, the atomic units of language modeling, represent words, subwords, or characters and are crucial for understanding and processing language.
While recurrent neural networks improve context understanding compared to N-grams, they have limitations, paving the way for the emergence of large language models that evaluate the whole context simultaneously.
- Define a few different types of language models and their components.
- Describe how large language models are created and the importance of context and parameters.
- Identify how large language models take advantage of self-attention.
- Reveal three key problems with large language models.
- Explain how fine-tuning and distillation can improve a model's predictions and efficiency.
This module assumes you are familiar with the concepts covered in the following modules:
- Introduction to Machine Learning
- Linear regression
- Working with categorical data
- Datasets, generalization, and overfitting
- Neural networks
- Embeddings
What is a language model?
Alanguage modelestimates the probability of atokenor sequence of tokens occurring within a longer sequence of tokens. A tokencould be a word, a subword (a subset of a word), or even a single character.
Click the icon to learn more about tokens.
Most modern language models tokenize by subwords, that is, by chunks oftext containing semantic meaning. The chunks could vary in length fromsingle characters like punctuation or the possessives to whole words.Prefixes and suffixes might be represented as separate subwords.For example, the wordunwatched might be represented by the followingthree subwords:
- un (the prefix)
- watch (the root)
- ed (the suffix)
The wordcats might be represented by the following two subwords:
- cat (the root)
- s (the suffix)
A more complex word like "antidisestablishmentarianism" might be representedas six subwords:
- anti
- dis
- establish
- ment
- arian
- ism
Tokenization is language specific, so the number of characters per tokendiffers across languages. For English, one token corresponds to ~4 charactersor about 3/4 of a word, so 400 tokens ~= 300 English words.
Tokens are the atomic unit or smallest unit of language modeling.
Tokens are now also being successfully applied tocomputer vision andaudio generation.
Consider the following sentence and the token(s) that might complete it:
When I hear rain on my roof, I _______ in my kitchen.
A language model determines the probabilities of different tokens orsequences of tokens to complete that blank. For example, the followingprobability table identifies some possible tokens and their probabilities:
| Probability | Token(s) |
|---|---|
| 9.4% | cook soup |
| 5.2% | warm up a kettle |
| 3.6% | cower |
| 2.5% | nap |
| 2.2% | relax |
In some situations, the sequence of tokens could be an entire sentence,paragraph, or even an entire essay.
An application can use the probability table to make predictions.The prediction might be the highest probability (for example, "cook soup")or a random selection from tokens having a probability greater than a certainthreshold.
Estimating the probability of what fills in the blank in a text sequence canbe extended to more complex tasks, including:
- Generating text.
- Translating text from one language to another.
- Summarizing documents.
By modeling the statistical patterns of tokens, modern language models developextremely powerful internal representations of language and can generateplausible language.
N-gram language models
N-grams are ordered sequences of wordsused to build language models, where N is the number of words in the sequence.For example, when N is 2, the N-gram is called a2-gram (or abigram); when N is 5, the N-gram iscalled a 5-gram. Given the following phrase in a training document:
you are very nice
The resulting 2-grams are as follows:
- you are
- are very
- very nice
When N is 3, the N-gram is called a3-gram (or atrigram). Given that same phrase, theresulting 3-grams are:
- you are very
- are very nice
Given two words as input, a language model based on 3-grams can predict thelikelihood of the third word. For example, given the following two words:
orange is
A language model examines all the different 3-grams derived from its trainingcorpus that start withorange is to determine the most likely third word.Hundreds of 3-grams could start with the two wordsorange is, but you canfocus solely on the following two possibilities:
orange is ripeorange is cheerful
The first possibility (orange is ripe) is about orange the fruit,while the second possibility (orange is cheerful) is about the colororange.
Context
Humans can retain relatively long contexts. While watching Act 3 of a play, youretain knowledge of characters introduced in Act 1. Similarly, thepunchline of a long joke makes you laugh because you can remember the contextfrom the joke's setup.
In language models,context is helpful information before or after thetarget token. Context can help a language model determine whether "orange"refers to a citrus fruit or a color.
Context can help a language model make better predictions, but does a3-gram provide sufficient context? Unfortunately, the only context a 3-gramprovides is the first two words. For example, the two wordsorange is doesn'tprovide enough context for the language model to predict the third word.Due to lack of context, language models based on 3-grams make a lot of mistakes.
Longer N-grams would certainly provide more context than shorter N-grams.However, as N grows, the relative occurrence of each instance decreases.When N becomes very large, the language model typically has only a singleinstance of each occurrence of N tokens, which isn't very helpful inpredicting the target token.
Recurrent neural networks
Recurrent neuralnetworksprovide more context than N-grams. A recurrent neural network is a type ofneural network that trains ona sequence of tokens. For example, a recurrent neural networkcangradually learn (and learn to ignore) selected context from each wordin a sentence, kind of like you would when listening to someone speak.A large recurrent neural network can gain context from a passage of severalsentences.
Although recurrent neural networks learn more context than N-grams, the amountof useful context recurrent neural networks can intuit is still relativelylimited. Recurrent neural networks evaluate information "token by token."In contrast, large language models—the topic of the nextsection—can evaluate the whole context at once.
Note that training recurrent neural networks for long contexts is constrained bythevanishing gradientproblem.
Exercise: Check your understanding
- A language model based on 6-grams
- A language model based on 5-grams
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-03 UTC.