Use reservations with batch inference

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

This document explains how to useCompute Engine reservations to gain a high level of assurance that your batch inference jobs have the necessary virtual machine (VM) resources to run.

Reservations are a Compute Engine feature. They help ensure that you have the resources available to create VMs with the same hardware (memory and vCPUs) and optional resources (CPUs, GPUs, TPUs, and Local SSD disks) whenever you need them.

When you create a reservation, Compute Engine verifies that the requested capacity is available in the specified zone. If so, then Compute Engine reserves the resources, creates the reservation, and the following happens:

You can immediately consume the reserved resources, and they remain available until you delete the reservation.
You're charged for the reserved resources at the same on-demand rate as running VMs, including any applicable discounts, until the reservation is deleted. A VM consuming a reservation doesn't incur separate charges. You're charged only for the resources outside of the reservation, such as disks or IP addresses. To learn more, seepricing for reservations.

Limitations and requirements

When using Compute Engine reservations with Vertex AI, consider the following limitations and requirements:

Vertex AI can only use reservations forCPUs,GPU VMs, orTPUs (Preview).
Vertex AI can't consume reservations of VMs that haveLocal SSD disks manually attached.
Using Compute Engine reservations with Vertex AI is only supported for Vertex AI serverless training, inference, andVertex AI Workbench (Preview).
A reservation's VM properties must match exactly with your Vertex AI workload to consume the reservation. For example, if a reservation specifies ana2-ultragpu-8g machine type, then the Vertex AI workload can only consume the reservation if it also uses ana2-ultragpu-8g machine type. SeeRequirements.
To consume a shared reservation of GPU VMs or TPUs, you must consume it using its owner project or a consumer project with which the reservation is shared. See How shared reservations work.
To consume aSPECIFIC_RESERVATION reservation, grant theCompute Viewer IAM role to the Vertex AI service account in the project that owns the reservations (service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com, wherePROJECT_NUMBER is the project number of the project that consumes the reservation).
The following services and capabilities aren't supported when using Compute Engine reservations with Vertex AI batch inference:
- Federal Risk and Authorization Management Program (FedRAMP) compliance

Billing

When using Compute Engine reservations, you're billed for the following:

Compute Engine pricing for the Compute Engine resources, including any applicable committed use discounts (CUDs). SeeCompute Engine pricing.
Vertex AI batch inference management fees in addition to your infrastructure usage. SeePrediction pricing.

Before you begin

Review the requirements and restrictions for reservations.
Review thequota requirements and restrictions for shared reservations.

Allow a reservation to be consumed

Before consuming a reservation of CPUs, GPU VMs, or TPUs, you must set itssharing policy to allow Vertex AI to consume the reservation. To do so, use one of the following methods:

Allow consumption while creating a reservation

When creating asingle-project orshared reservation of GPU VMs, you can allow Vertex AI to consume the reservation as follows:

If you're using the Google Cloud console, then, in theGoogle Cloud services section, selectShare reservation.
If you're using the Google Cloud CLI, then include the--reservation-sharing-policy flag set toALLOW_ALL.
If you're using the REST API, then, in the request body, include theserviceShareType field set toALLOW_ALL.

Allow consumption in an existing reservation

You can modify an auto-created reservation of GPU VMs or TPUs for a future reservation only after the reservation's start time.

To allow Vertex AI to consume an existing reservation, use one of the following methods:

Verify that a reservation is consumed

To verify that the reservation is being consumed, seeVerify reservations consumptionin the Compute Engine documentation.

Get batch inferences by using a reservation

To create a batch inference request that consumesa Compute Engine reservation of GPU VMs, you can use the REST API andchoose either Cloud Storage or BigQuery for thesource and destination.

Cloud Storage

Before using any of the request data, make the following replacements:

LOCATION_ID: The region where the model is stored and the batch prediction job is executed. For example,us-central1.
PROJECT_ID: The project where the reservation was created. To consume a shared reservation from another project, you must share the reservation with that project. For more information, see Modify the consumer projects in a shared reservation.
BATCH_JOB_NAME: A display name for the batch prediction job.
MODEL_ID: The ID for the model to use for making predictions.
INPUT_FORMAT: Theformat of your input data:jsonl,csv,tf-record,tf-record-gzip, orfile-list.
INPUT_URI: The Cloud Storage URI of your input data. May contain wildcards.
OUTPUT_DIRECTORY: The Cloud Storage URI of a directory where you want Vertex AI to save output.
MACHINE_TYPE: Themachine resources to be used for this batch prediction job.
ACCELERATOR_TYPE: The type of accelerator to attach to the machine. For more information about the type of GPU that each machine type supports, seeGPUs for compute workloads.
ACCELERATOR_COUNT: The number of accelerators to attach to the machine.
RESERVATION_AFFINITY_TYPE: Must beANY,SPECIFIC_RESERVATION, orNONE.
- ANY means that the VMs of yourcustomJob can automatically consume any reservation with matching properties.
- SPECIFIC_RESERVATION means that the VMs of yourcustomJob can only consume a reservation that the VMs specifically targets by name.
- NONE means that the VMs of yourcustomJob can't consume any reservation. SpecifyingNONE has the same effect as omitting a reservation affinity specification.
BATCH_SIZE: The number of instances to send in each prediction request; the default is 64. Increasing the batch size can lead to higher throughput, but it can also cause request timeouts.
STARTING_REPLICA_COUNT: The number of nodes for this batch prediction job.

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/batchPredictionJobs

Request JSON body:

{  "displayName": "BATCH_JOB_NAME",  "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID",  "inputConfig": {    "instancesFormat": "INPUT_FORMAT",    "gcsSource": {      "uris": ["INPUT_URI"],    },  },  "outputConfig": {    "predictionsFormat": "jsonl",    "gcsDestination": {      "outputUriPrefix": "OUTPUT_DIRECTORY",    },  },  "dedicatedResources" : {    "machineSpec" : {      "machineType":MACHINE_TYPE,      "acceleratorType": "ACCELERATOR_TYPE",      "acceleratorCount":ACCELERATOR_COUNT,      "reservationAffinity": {        "reservationAffinityType": "RESERVATION_AFFINITY_TYPE",        "key": "compute.googleapis.com/reservation-name",        "values": [          "projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME"        ]      }    },    "startingReplicaCount":STARTING_REPLICA_COUNT  },  "manualBatchTuningParameters": {    "batch_size":BATCH_SIZE,  }}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by running gcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/batchPredictionJobs"

PowerShell

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by running gcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/batchPredictionJobs" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{  "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/batchPredictionJobs/BATCH_JOB_ID",  "displayName": "BATCH_JOB_NAME 202005291958",  "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID",  "inputConfig": {    "instancesFormat": "jsonl",    "gcsSource": {      "uris": [        "INPUT_URI"      ]    }  },  "outputConfig": {    "predictionsFormat": "jsonl",    "gcsDestination": {      "outputUriPrefix": "OUTPUT_DIRECTORY"    }  },  "state": "JOB_STATE_PENDING",  "createTime": "2020-05-30T02:58:44.341643Z",  "updateTime": "2020-05-30T02:58:44.341643Z",}

BigQuery

Before using any of the request data, make the following replacements:

LOCATION_ID: The region where the model is stored and the batch prediction job is executed. For example,us-central1.
PROJECT_ID: The project where the reservation was created. To consume a shared reservation from another project, you must share the reservation with that project. For more information, see Modify the consumer projects in a shared reservation.
BATCH_JOB_NAME: A display name for the batch prediction job.
MODEL_ID: The ID for the model to use for making predictions.
INPUT_PROJECT_ID: The ID of the Google Cloud project where you want to get the data from.
INPUT_DATASET_NAME: The name of the BigQuery dataset where you want to get the data from.
INPUT_TABLE_NAME: The name of the BigQuery table where you want to get the data from.
OUTPUT_PROJECT_ID: The ID of the Google Cloud project where you want to save the output.
OUTPUT_DATASET_NAME: The name of the destination BigQuery dataset where you want to save the output.
OUTPUT_TABLE_NAME: The name of the BigQuery destination table where you want to save the output.
MACHINE_TYPE: Themachine resources to be used for this batch prediction job.
ACCELERATOR_TYPE: The type of accelerator to attach to the machine. For more information about the type of GPU that each machine type supports, seeGPUs for compute workloads.
ACCELERATOR_COUNT: The number of accelerators to attach to the machine.
RESERVATION_AFFINITY_TYPE: Must beANY,SPECIFIC_RESERVATION, orNONE.
- ANY means that the VMs of yourcustomJob can automatically consume any reservation with matching properties.
- SPECIFIC_RESERVATION means that the VMs of yourcustomJob can only consume a reservation that the VMs specifically targets by name.
- NONE means that the VMs of yourcustomJob can't consume any reservation. SpecifyingNONE has the same effect as omitting a reservation affinity specification.
BATCH_SIZE: The number of instances to send in each prediction request; the default is 64. Increasing the batch size can lead to higher throughput, but it can also cause request timeouts.
STARTING_REPLICA_COUNT: The number of nodes for this batch prediction job.

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/batchPredictionJobs

Request JSON body:

{  "displayName": "BATCH_JOB_NAME",  "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID",  "inputConfig": {    "instancesFormat": "bigquery",    "bigquerySource": {      "inputUri": "bq://INPUT_PROJECT_ID.INPUT_DATASET_NAME.INPUT_TABLE_NAME"    },  },  "outputConfig": {    "predictionsFormat":"bigquery",    "bigqueryDestination":{      "outputUri": "bq://OUTPUT_PROJECT_ID.OUTPUT_DATASET_NAME.OUTPUT_TABLE_NAME"    }  },  "dedicatedResources" : {    "machineSpec" : {      "machineType":MACHINE_TYPE,      "acceleratorType": "ACCELERATOR_TYPE",      "acceleratorCount":ACCELERATOR_COUNT,      "reservationAffinity": {        "reservationAffinityType": "RESERVATION_AFFINITY_TYPE",        "key": "compute.googleapis.com/reservation-name",        "values": [          "projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME"        ]      }    },    "startingReplicaCount":STARTING_REPLICA_COUNT  },  "manualBatchTuningParameters": {    "batch_size":BATCH_SIZE,  }}

To send your request, choose one of these options:

curl

Save the request body in a file namedrequest.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/batchPredictionJobs"

PowerShell

Save the request body in a file namedrequest.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/batchPredictionJobs" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{  "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/batchPredictionJobs/BATCH_JOB_ID",  "displayName": "BATCH_JOB_NAME 202005291958",  "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID",  "inputConfig": {    "instancesFormat": "jsonl",    "bigquerySource": {      "uris": [        "INPUT_URI"      ]    }  },  "outputConfig": {    "predictionsFormat": "jsonl",    "bigqueryDestination": {      "outputUri": "OUTPUT_URI"    }  },  "state": "JOB_STATE_PENDING",  "createTime": "2020-05-30T02:58:44.341643Z",  "updateTime": "2020-05-30T02:58:44.341643Z",}

Retrieve batch inference results

When a batch inference task is complete, the output of the inference is storedin the Cloud Storage bucket or BigQuery location that youspecified in your request.

What's next

Learn more aboutreservations of Compute Enginezonal resources.
Learn how touse reservations with Vertex AIonline inference.
Learn how touse reservations with Vertex AItraining.
Learn how toview reservations.
Learn how tomonitor reservationsconsumption.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.

Movatterモバイル変換

Use reservations with batch inference Stay organized with collections Save and categorize content based on your preferences.

Limitations and requirements

Billing

Before you begin

Allow a reservation to be consumed

Allow consumption while creating a reservation

Allow consumption in an existing reservation

Verify that a reservation is consumed

Get batch inferences by using a reservation

Cloud Storage

curl

PowerShell

BigQuery

curl

PowerShell

Retrieve batch inference results

What's next

Use reservations with batch inference