Design multimodal prompts

The Gemini API in Vertex AI lets you send include multimodal inputs like text,images, and video as part of your prompt to Gemini models.This page provides best practices for designing multimodal prompts and howto troubleshoot if your prompts aren't working as expected. For general guidanceon prompt design, seeour recommended prompt design strategies,or see the following list of multimodal-specific best practices.

Prompt design fundamentals

Use specific instructions

Clear and detailed prompts provide the best results. If you have aspecific output in mind, include that requirement in the prompt to increase thelikelihood that you get the correct output.

Even when a prompt seems clear to the person who wrote it, the prompt might beunderspecified. To customize model behavior, consider how your promptcould be interpreted and ensure that the instructions you give are specificand clear.

If you included the image of an airport board below as part of your prompt,asking the model to just "describe this image" could generate a generaldescription. If you need the model to parse the time and city from the image,include that request directly in your prompt.

PromptModel response
An airport timetable board with times and cities.

Describe this image.

The image shows an airport arrivals and departures board.

Updated promptImproved response
An airport timetable board with times and cities.

Parse the time and city from the airport board shown in this image into a list.

  • 10:50 Moscow
  • 11:05 Edinburgh
  • 11:05 London
  • 11:10 Bucharest
  • 11:30 Kiev
  • 11:35 Dublin
  • 11:45 East Midlands
  • 12:15 Sofia
  • 12:30 London
  • 12:30 Newcastle
  • 12:40 St Petersburg
  • 12:40 London
  • 12:45 Manchester

Add examples

The model can accept multiple inputs to use as examples to understand theoutput you want. Adding these examples can help the model identify the patternsand apply the relationship between the given images and responses to the newexample. This is called"few-shot" learning.

In the following example, the initial output is a full sentence that alsocontains the country (Brazil). If the preferred output is only the city, notthe country, then adding few-shot examples to your prompt can guidethe model to respond in that format.

PromptModel response
The Christ the Redeemer statue in Rio de Janeiro, Brazil.

Determine the city along with the landmark.

The landmark is the Christ the Redeemer statue in Rio de Janeiro, Brazil.

Updated promptImproved response

Determine the city along with the landmark.

The Colosseum in Rome, Italy.

city: Rome, landmark: the Colosseum.

The Forbidden City in Beijing, China.

city: Beijing, landmark: Forbidden City

The Christ the Redeemer statue in Rio de Janeiro, Brazil.

city: Rio de Janeiro, landmark: Christ the Redeemer statue

Split complex tasks into smaller tasks

For complex tasks like ones that require both visual understanding andreasoning, split the task into smaller, more straightforward steps ordirectly ask the model to think step by step in the prompt.

PromptModel response
A picture of three rolls of toilet paper.

When will I run out of toilet paper?

Soon, you only have 3 rolls left.

Updated promptImproved response
A picture of three rolls of toilet paper.
  1. First, count how many toilet paper rolls are in this picture.
  2. Then, determine how much toilet paper a typical person uses per day.
  3. Calculate how long these rolls of toilet paper will last.
  1. There are 3 rolls of toilet paper in this picture.
  2. A typical person uses about 20 sheets of toilet paper per day.
  3. If each roll contains 200 sheets, then each roll will last for about 10 days. Therefore, the 3 rolls will last for about a month.

Math problems or other types of word problems can also be broken down into smaller tasks.

PromptResponse
A picture of mathematical sequence b. b(1) = 15, and b(n) = b(n-1) * (-3).

What is the 4th term in the sequence?

-135

The response from the model is incorrect. Try asking the model to parse the problem before answering the prompt.

PromptResponse
A picture of mathematical sequence b. b(1) = 15, and b(n) = b(n-1) * (-3).

Parse the formula in the image first.

Then based on the formula, answer "what's the 4th term in the sequence?".

Click to expand the result $$b(n) = b(n-1) * (-3)$$ $$b(1) = 15$$
  1. Substitute n = 4 into the formula:

    $$b(4) = b(4-1) * (-3)$$

    $$b(4) = b(3) * (-3)$$

  2. We don't know the value of b(3), so substitute n = 3 into the formula:

    $$b(3) = b(3-1) * (-3)$$

    $$b(3) = b(2) * (-3)$$

  3. We don't know the value of b(2), so substitute n = 2 into the formula:

    $$b(2) = b(2-1) * (-3)$$

    $$b(2) = b(1) * (-3)$$

  4. We know the value of b(1) is 15, so substitute it into the formula:

    $$b(2) = 15 * (-3)$$

    $$b(2) = -45$$

  5. Now that we know the value of b(2), substitute it into the formula for b(3):

    $$b(3) = b(2) * (-3)$$

    $$b(3) = -45 * (-3)$$

    $$b(3) = 135$$

  6. Now that we know the value of b(3), substitute it into the formula for b(4):

    $$b(4) = b(3) * (-3)$$

    $$b(4) = 135 * (-3)$$

    $$b(4) = -405$$

Therefore, the 4th term in the sequence is -405.

To make the response even more streamlined, you can ask the model to thinkthrough the prompt step by step.

Updated promptImproved response
A picture of mathematical sequence b. b(1) = 15, and b(n) = b(n-1) * (-3).

What is the 4th term in the sequence? Think step by step.

The first term is 15.

The second term is b(1) * (-3) = 15 * (-3) = -45.

The third term is b(2) * (-3) = -45 * (-3) = 135.

The fourth term is b(3) * (-3) = 135 * (-3) = -405.

Specify the output format

When a certain format is required, you can instruct the model to use that formatfor its output.

PromptModel response
A picture of a spreadsheet in Excel.

Parse the table in this image into Markdown format

| Subject   | Status      | Time    | Start date | Due on     ||-----------|-------------|---------|------------|------------|| Math      | Done        | 1.5 hrs | 11/29/2023 | 12/1/2023  || Geography | Done        | 30 min  | 12/2/2023  | 12/4/2023  || Physics   | In progress | 2 hrs   | 12/3/2023  | 12/5/2023  || Biology   | Not started | 6 hrs   | 11/2/2016  | 12/6/2023  |
PromptModel response
A picture of a meal with shrimp, clams, and other ingredients.

Provide a list of all the following attributes:

ingredients, type of cuisine, vegetarian or not, in JSON format

        {          "ingredients": [            "rice",            "shrimp",            "clams",            "mussels",            "peas",            "tomatoes",            "onions",            "garlic",            "olive oil",            "paprika",            "salt",            "pepper"          ],          "type of cuisine": "Spanish",          "vegetarian": false        }

Troubleshoot a multimodal prompt

Troubleshooting a prompt can involve improving the language of your prompt,directing the focus of your prompt, or tuning parameters. These differentapproaches can lead to improved results, especially after determining the rootcause of the issue.

Focus on the relevant part of an image

To improve the response, point out which parts of the image are mostrelevant to the prompt.

PromptModel response
A picture of the side of a box of diapers; the number 198 is written on the box.

How many days will these diapers last a baby?

The diapers will last for 198 days before they run out.

Updated promptImproved response
A picture of the side of a box of diapers; the number 198 is written on the box.

How long will these diapers last before I run out?

Use the weight shown on the box to determine the child's age, and use the total number of diapers in the box.

Divide the total number by how many diapers the child goes through per day.

The box contains 198 diapers. The box states the diapers are for a child weighing 22-37 lbs.

The average child in this weight range is likely between 1 and 2 years old.

A 2-year-old typically goes through 6 diapers per day, therefore, the diapers will last around 33 days (198/6).

Tailor the model's response to input

Ask the model to describe the images before performing the task in the prompt.

PromptModel response
A picture of a person packing a suitcase.A picture of an empty hallway in a shopping mall.A picture of three rolls of toilet paper.A picture of dark clouds breaking over the sea.

What is in common between these images?

All four images are examples of landscape photography.

Updated promptImproved response
A picture of a person packing a suitcase.A picture of an empty hallway in a shopping mall.A picture of three rolls of toilet paper.A picture of dark clouds breaking over the sea.

First, describe what's in each image in detail. What's in common between these images?

The first image shows a person packing a suitcase. The second image shows an empty hallway in a shopping mall. The third image shows a roll of toilet paper. The fourth image shows a dark and stormy sea.

The commonality between these images is that they are all examples of liminal spaces. Liminal spaces are places or situations that are in between two other places or situations. They are often characterized by a sense of uncertainty or ambiguity.

The first image shows the person in between being at home and being on a trip. The second image shows the hallway in between two stores. The third image is of toilet paper, which can represent a bathroom. A bathroom is a transitional space between private and public realms. It's a place where we prepare ourselves to enter or leave the outside world.

The fourth image shows the sea in between being calm and being stormy.

Liminal spaces can be both exciting and unsettling. They can be a place of new beginning or a place of endings. They can be a place of reflection or a place of action.

You can explicitly ask the model to refer to what's in the image.

Updated promptImproved response
A picture of a person packing a suitcase.A picture of an empty hallway in a shopping mall.A picture of three rolls of toilet paper.A picture of dark clouds breaking over the sea.

What is in common between these images? Refer to what's in the images in your response.

All of the images are related to travel. The first image shows a person packing a suitcase, the second image shows an empty hallway in a shopping mall, the third image shows a stack of toilet paper, and the fourth image shows a dark and stormy sea.

These images are all related to travel because they are all things that people need or use when they travel. The first image shows a person packing a suitcase, which is something that people do when they are going on a trip. The second image shows an empty hallway in a shopping mall, which is a place where people often go to buy things for their trips. The third image shows a stack of toilet paper, which is something that people often use when they are traveling. The fourth image shows a dark and stormy sea, which is something that people often encounter when they are traveling.

The same tips work for creative output. Explicitly ask the model to includeaspects from the image; this can improve the output.

PromptModel response
A picture of a brown rabbit.

Can you write me a poem about this image?

Little bunny, so soft and brown,        Hopping through the field, up and down.        Your ears so long, your tail so small,        You're the cutest creature of them all.
You nibble on the grass, so green and sweet,        And hop away when you see me.        But I'll always be your friend,        Little bunny, until the very end.
Updated promptImproved response
A picture of a brown rabbit.

Can you write me a descriptive and dramatic poem about this image and include the location?

Little bunny on the green,        Ears so long, so keen.        What do you see out there,        In the distance, where you stare?
Is it the ocean blue,        Sparkling in the sun's hue?        Or the waves that crash and roar,        A rhythm that you adore?
Little bunny, tell me true,        What captures your attention so?        Is it the salty breeze,        Or the seagulls' cries that please?
Perhaps it's the vastness of it all,        A world so big and small.        A place where dreams can roam,        And adventures can be sown.
Little bunny, I wonder what you dream,        As you sit on the grass, so serene.        Do you yearn to explore the deep,        Or stay on land, where you can leap?
Whatever it may be, little bunny,        Keep that spark of wonder burning brightly.        For in your dreams and aspirations,        Lies a world waiting for your creation.

Determine how the prompt failed

A prompt can fail because the model did not understand the image at all, orbecause it did not perform the correct reasoning steps afterward. Todisambiguate between those reasons, ask the model to describe what's in theimage.

In this example, if the model responds with a snack that seems surprising whenpaired with tea (like popcorn), first determine whether the model recognizedthat the image contains tea.

PromptPrompt for troubleshooting
A picture of a tea set.

What's a snack I can make in 1 minute that would go well with this?

A picture of a tea set.

Describe what's in this image.

Asking the model to explain its reasoning can help narrow down which part ofthe reasoning broke down, if any.

PromptPrompt for troubleshooting
A picture of a tea set.

What's a snack I can make in 1 minute that would go well with this?

A picture of a tea set.

What's a snack I can make in 1 minute that would go well with this? Please explain why.

Tune the sampling parameters

In each request, you send not only the multimodal prompt but a set of samplingparameters to the model. The model can generate different results for differentparameter values. Experiment with the different parameters to get the bestvalues for the task. The most commonly adjusted parameters are the following:

Temperature

Temperature is used for sampling during response generation, which occurs whentop-P and top-K are applied. Temperature controls the degree of randomness intoken selection. Lower temperatures are good for prompts that require a moredeterministic and less open-ended or creative response, while highertemperatures can lead to more diverse or creative results. A temperature of0 is deterministic, meaning that the highest probability response is mostlikely to be selected.

For most use cases, start with a temperature of 0.4. If you need morecreative results, increase the temperature. If you observe clearhallucinations, reduce the temperature.

Top-P

Top-P changes how the model selects tokens for output. Tokens are selectedfrom the most to least probable until the sum of their probabilities equalsthe top-P value. For example, if tokens A, B, and C have a probability of 0.6,0.3, 0.1 and the top-P value is 0.9, then the model will select either A or Bas the next token by using temperature and excludes C as a candidate.

Specify a lower value for less random responses and a higher value for morerandom responses. The default value of top-P is 1.0.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.