LLMs: Fine-tuning, distillation, and prompt engineering Stay organized with collections Save and categorize content based on your preferences.
Page Summary
Foundation LLMs are pre-trained on vast amounts of text, enabling them to understand language structure and generate creative content, but they require fine-tuning for specific ML tasks like classification or regression.
Fine-tuning adapts a foundation LLM to a particular task by training it on task-specific data, improving its performance for that task but retaining the original model size.
Distillation produces a smaller, more efficient version of a fine-tuned LLM, sacrificing some performance for reduced computational and environmental costs.
Prompt engineering allows users to customize an LLM's output by providing examples or instructions within the prompt, leveraging the model's existing abilities without changing its parameters.
Offline inference pre-computes and caches LLM predictions for tasks where real-time response isn't critical, saving resources and enabling the use of larger models.
Theprevious unit described general-purpose LLMs, variouslyknown as:
- foundation LLMs
- base LLMs
- pre-trained LLMs
A foundation LLM is trained on enough natural language to "know" a remarkableamount about grammar, words, and idioms. A foundation language model cangenerate helpful sentences about topics it is trained on.Furthermore, a foundation LLM can perform certain tasks traditionally called"creative," like writing poetry. However, a foundation LLM's generative textoutput isn't a solution for other kinds of common ML problems, such asregression or classification. For these use cases, a foundation LLM can serveas aplatform rather than a solution.
Transforming a foundation LLM into a solution that meets an application'sneeds requires a process calledfine-tuning. A secondary process calleddistillation generates a smaller (fewer parameters) version of the fine-tunedmodel.
Fine-tuning
Research shows that the pattern-recognition abilities of foundationlanguage models are so powerful that they sometimes require relativelylittle additional training to learn specific tasks.That additional training helps the model make better predictionson a specific task. This additional training, calledfine-tuning,unlocks an LLM's practical side.
Fine-tuning trains on examplesspecific to the task your applicationwill perform. Engineers can sometimes fine-tune a foundation LLM on just a fewhundred or a few thousand training examples.
Despite the relatively tiny number of training examples, standard fine-tuningis often computationally expensive. That's because standard fine-tuning involvesupdating the weight and bias of every parameter on eachbackpropagation iteration.Fortunately, a smarter process calledparameter-efficienttuningcan fine-tune an LLM by adjusting only asubset of parameters on eachbackpropagation iteration.
A fine-tuned model's predictions are usually better than the foundation LLM'spredictions. However, a fine-tuned model contains the same number ofparameters as the foundation LLM. So, if a foundation LLM contains ten billionparameters, then the fine-tuned version will also contain ten billionparameters.
Distillation
Most fine-tuned LLMs contain enormous numbers of parameters. Consequently,foundation LLMs require enormous computational and environmental resourcesto generate predictions. Note that large swaths of those parameters aretypically irrelevant for a specific application.
Distillationcreates a smaller version of an LLM. The distilled LLM generates predictionsmuch faster and requires fewer computational and environmental resources thanthe full LLM. However, the distilled model's predictions are generally notquite as good as the original LLM's predictions. Recall that LLMs with moreparameters almost always generate better predictions than LLMs with fewerparameters.
Click the icon to learn how distillation works.
The most common form of distillation uses bulk inference to label data.This labeled data is then used to train a new, smaller model (known as thestudent model) that can be more affordably served.The labeled data serves as a channel by which the larger model (known as theteacher model) funnels its knowledge to the smaller model.
For example, suppose you need an online toxicity scorer for automatic moderationof comments. In this case, you can use a large offline toxicity scorer to labeltraining data. Then, you can use that training data to distill a toxicity scorermodel small enough to be served and handle live traffic.
A teacher model can sometimes provide more labeled data than it was trained on.Alternatively, a teacher model can funnel a numerical score instead of a binarylabel to the student model. A numerical score provides a richer training signalthan a binary label, enabling the student model to predict not only positiveand negative classes but also borderline classes.
Prompt engineering
Prompt engineeringenables an LLM'send users to customize the model's output.That is, end users clarify how the LLM should respond to their prompt.
Humans learn well from examples. So do LLMs. Showing one example to an LLMis calledone-shot prompting.For example, suppose you want a model to use the following format to outputa fruit's family:
User inputs the name of a fruit: LLM outputs that fruit's class.
A one-shot prompt shows the LLM a single example of the preceding formatand then asks the LLM to complete a query based on that example. For instance:
peach: drupeapple: ______
A single example is sometimes sufficient. If it is, the LLM outputs a usefulprediction. For instance:
apple: pome
In other situations, a single example is insufficient. That is, the user mustshow the LLMmultiple examples. For instance, the following prompt containstwo examples:
plum: drupepear: pomelemon: ____
Providing multiple examples is calledfew-shot prompting.You can think of the first two lines of the preceding prompt as trainingexamples.
Can an LLM provide useful predictions with no examples (zero-shotprompting)? Sometimes, butLLMs like context. Without context, the following zero-shot prompt mightreturn information about the technology company rather than the fruit:
apple: _______
Offline inference
The number of parameters in an LLM is sometimes solarge thatonline inferenceis too slow to be practical for real-world tasks like regression orclassification. Consequently, many engineering teams rely onoffline inference (alsoknown asbulk inference orstatic inference) instead.In other words, rather than responding to queries at serving time, thetrained model makes predictions in advance and then caches those predictions.
It doesn't matter if it takes a long time for an LLM to complete its task ifthe LLM only has to perform the task once a week or once a month.
For example, Google Searchused an LLMto perform offline inference in order to cache a list of over 800 synonymsfor Covid vaccines in more than 50 languages. Google Search then used thecached list to identify queries about vaccines in live traffic.
Use LLMs responsibly
Like any form of machine learning, LLMs generally share the biases of:
- The data they were trained on.
- The data they were distilled on.
Use LLMs fairly and responsibly, following the guidelines presentedin thedata modules and theFairness module.
Exercise: Check your understanding
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-03 UTC.