Prepare your evaluation dataset

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

This page describes how to prepare your dataset for the Gen AI evaluation service.

Overview

The Gen AI evaluation service automatically detects and handles several common data formats. This means you can often use your data as-is without needing to perform manual conversions.

The fields you need to provide in your dataset depend on your goal:

Goal	Required data	SDK workflow
Generate new responses and then evaluate them	`prompt`	`run_inference()` →`evaluate()`
Evaluate existing responses	`prompt` and`response`	`evaluate()`
Generate new agent running results and then evaluate them	`prompt`	`run_inference()` →`evaluate()`
Evaluate existing agent responses and intermediate events	`prompt`,`response`, and`intermediate_events`	`evaluate()`

When runningclient.evals.evaluate() orclient.evals.create_evaluation_run(), the Gen AI evaluation service automatically looks for the following common fields in your dataset:

prompt: (Required) The input to the model that you want to evaluate. For best results, you should provide example prompts that represent the types of inputs that your models process in production.
response: (Required) The output generated by the model or application that is being evaluated.
reference: (Optional) The ground truth or "golden" answer that you can compare the model's response against. This field is often required for computation-based metrics likebleu androuge.
conversation_history: (Optional) A list of preceding turns in a multi-turn conversation. The Gen AI evaluation service automatically extracts this field from supported formats. For more information, seeHandling multi-turn conversations.
session_inputs: (Optional) Input to initialize asession to run an agent.This is only optional for therun_inference() →evaluate()workflow.
intermediate_events: (Optional)Agent traces of a single turn in an agent run, including function calls, function responses, and intermediate model responses. This field is not required for therun_inference() →evaluate() workflow.

Supported data formats

The Gen AI evaluation service supports the following formats:

Pandas DataFrame

For straightforward evaluations, you can use apandas.DataFrame. The Gen AI evaluation service looks for common column names likeprompt,response, andreference. This format is fully backward-compatible.

importpandasaspd# Example DataFrame with prompts and ground truth referencesprompts_df=pd.DataFrame({"prompt":["What is the capital of France?","Who wrote 'Hamlet'?",],"reference":["Paris","William Shakespeare",]})# You can use this DataFrame directly with run_inference or evaluateeval_dataset=client.evals.run_inference(model="gemini-2.5-flash",src=prompts_df)eval_result=client.evals.evaluate(dataset=eval_dataset,metrics=[types.PrebuiltMetric.GENERAL_QUALITY])eval_result.show()

Gemini batch prediction format

You can directly use the output of a Vertex AI batch prediction job, which are typically JSONL files stored in Cloud Storage, where each line contains a request and response object. The Gen AI evaluation service parses this structure automatically to provide integration with other Vertex AI services.

The following is an example of a single line in a JSONl file:

{"request":{"contents":[{"role":"user","parts":[{"text":"Why is the sky blue?"}]}]},"response":{"candidates":[{"content":{"role":"model","parts":[{"text":"The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering."}]}}]}}

You can then evaluate pre-generated responses from a batch job directly:

# Cloud Storage path to your batch prediction output filebatch_job_output_uri="gs://path/to/your/batch_output.jsonl"# Evaluate the pre-generated responses directlyeval_result=client.evals.evaluate(dataset=batch_job_output_uri,metrics=[types.PrebuiltMetric.GENERAL_QUALITY])eval_result.show()

OpenAI Chat Completion format

For evaluating or comparing with third-party models such as OpenAI and Anthropic, the Gen AI evaluation service supports the OpenAI Chat Completion format. You can supply a dataset where each row is a JSON object structured like an OpenAI API request. The Gen AI evaluation service automatically detects this format.

The following is an example of a single line in this format:

{"request":{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What's the capital of France?"}],"model":"gpt-4o"}}

You can use this data to generate responses from a third-party model and evaluate the responses:

# Ensure your third-party API key is set# e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key'openai_request_uri="gs://path/to/your/openai_requests.jsonl"# Generate responses using a LiteLLM-supported model stringopenai_responses=client.evals.run_inference(model="gpt-4o",# LiteLLM compatible model stringsrc=openai_request_uri,)# The resulting openai_responses object can then be evaluatedeval_result=client.evals.evaluate(dataset=openai_responses,metrics=[types.PrebuiltMetric.GENERAL_QUALITY])eval_result.show()

Handling multi-turn conversations

The Gen AI evaluation service automatically parses multi-turn conversation data from supported formats. When your input data includes a history of exchanges (such as within therequest.contents field in the Gemini format, orrequest.messages in the OpenAI format), the Gen AI evaluation service identifies the previous turns and processes them asconversation_history.

This means you don't need to manually separate the current prompt from the prior conversation, since the evaluation metrics can use the conversation history to understand the context of the model's response.

Consider the following example of a multi-turn conversation in Gemini format:

{"request":{"contents":[{"role":"user","parts":[{"text":"I'm planning a trip to Paris."}]},{"role":"model","parts":[{"text":"That sounds wonderful! What time of year are you going?"}]},{"role":"user","parts":[{"text":"I'm thinking next spring. What are some must-see sights?"}]}]},"response":{"candidates":[{"content":{"role":"model","parts":[{"text":"For spring in Paris, you should definitely visit the Eiffel Tower, the Louvre Museum, and wander through Montmartre."}]}}]}}

The multi-turn conversation is automatically parsed as follows:

prompt: The last user message is identified as the current prompt ({"role": "user", "parts": [{"text": "I'm thinking next spring. What are some must-see sights?"}]}).
conversation_history: The preceding messages are automatically extracted and made available as the conversation history ([{"role": "user", "parts": [{"text": "I'm planning a trip to Paris."}]}, {"role": "model", "parts": [{"text": "That sounds wonderful! What time of year are you going?"}]}]).
response: The model's reply is taken from theresponse field ({"role": "model", "parts": [{"text": "For spring in Paris..."}]}).

What's next

Run an evaluation.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-16 UTC.

Movatterモバイル変換

Prepare your evaluation dataset Stay organized with collections Save and categorize content based on your preferences.