Use managed datasets

This page shows you how to use Vertex AI managed datasets to trainyour custom models. Managed datasets offer the following benefits:

  • Manage your datasets in a central location.
  • Create labels and multiple annotation sets.
  • Create tasks for human labeling using integrated data labeling.
  • Track lineage to models for governance and iterative development.
  • Compare model performance by training AutoML and custom models using the samedatasets.
  • Generate data statistics and visualizations.
  • Automatically split data into training, test, and validation sets.

Before you begin

Before you can use a managed dataset in your training application, you mustcreate your dataset. You must create the dataset and the trainingpipeline that you use for training in the same region. You must use aregionwhereDataset resources areavailable.

Access a dataset from your training application

When youcreate a serverless training pipeline, you canspecify that your training application uses a Vertex AI dataset.

At runtime, Vertex AI passes metadata about your dataset to yourtraining application by setting the following environment variables in yourtraining container.

  • AIP_DATA_FORMAT: The format that your dataset is exported in. Possiblevalues include:jsonl,csv, orbigquery.
  • AIP_TRAINING_DATA_URI: The BigQuery URI of your training data orthe Cloud Storage URI of your training data file.
  • AIP_VALIDATION_DATA_URI: The BigQuery URI for your validationdata or the Cloud Storage URI of your validation data file.
  • AIP_TEST_DATA_URI: The BigQuery URI for your test data orthe Cloud Storage URI of your test data file.

If theAIP_DATA_FORMAT of your dataset isjsonl orcsv, the data URIvalues refer to Cloud Storage URIs, likegs://bucket_name/path/training-*. To keepthe size of each data file relatively small, Vertex AI splits yourdataset into multiple files. Because your training, validation, or test datamay be split into multiple files, the URIs are provided in wildcard format.

Learn more about downloading objects using the Cloud Storage codesamples.

If yourAIP_DATA_FORMAT isbigquery, the data URI values refer toBigQuery URIs, likebq://project.dataset.table.

Learn more about paging through BigQuery data.

Dataset format

Use the following sections to learn more about how Vertex AIformats your data when passing a dataset to your training application.

Image datasets

Image datasets are passed to your training application inJSON Lines format.Select the tab for your dataset's objective, to learn more about howVertex AI formats your dataset.

Single-label classification

Vertex AI uses the following publicly accessible schema whenexporting a single-label image classification dataset. This schema dictates theformat of the data export files. The schema's structure follows theOpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.

{  "imageGcsUri": "gs://bucket/filename.ext",  "classificationAnnotation": {    "displayName": "LABEL",    "annotationResourceLabels": {        "aiplatform.googleapis.com/annotation_set_name": "displayName",        "env": "prod"      }   },  "dataItemResourceLabels": {    "aiplatform.googleapis.com/ml_use": "training/test/validation"  }}

Field notes:

  • imageGcsUri: The Cloud Storage URI of this image.
  • annotationResourceLabels: Contains any number of key-value string pairs. Vertex AI uses this field to specify the annotation set.
  • dataItemResourceLabels - Contains any number of key-value string pairs. Specifies the machine learning use of the data item, such as training, test, or validation.

Example JSON Lines

{"imageGcsUri": "gs://bucket/filename1.jpeg",  "classificationAnnotation": {"displayName": "daisy"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}{"imageGcsUri": "gs://bucket/filename2.gif",  "classificationAnnotation": {"displayName": "dandelion"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}{"imageGcsUri": "gs://bucket/filename3.png",  "classificationAnnotation": {"displayName": "roses"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}{"imageGcsUri": "gs://bucket/filename4.bmp",  "classificationAnnotation": {"displayName": "sunflowers"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}{"imageGcsUri": "gs://bucket/filename5.tiff",  "classificationAnnotation": {"displayName": "tulips"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}...

Multi-label classification

Vertex AI uses the following publicly accessible schema whenexporting a multi-label image classification dataset. This schema dictates theformat of the data export files. The schema's structure follows theOpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_multi_label_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.

{  "imageGcsUri": "gs://bucket/filename.ext",  "classificationAnnotations": [    {      "displayName": "LABEL1",      "annotationResourceLabels": {        "aiplatform.googleapis.com/annotation_set_name":"displayName",        "label_type": "flower_type"      }    },    {      "displayName": "LABEL2",      "annotationResourceLabels": {        "aiplatform.googleapis.com/annotation_set_name":"displayName",        "label_type": "image_shot_type"      }    }  ],  "dataItemResourceLabels": {    "aiplatform.googleapis.com/ml_use": "training/test/validation"  }}

Field notes:

  • imageGcsUri: The Cloud Storage URI of this image.
  • annotationResourceLabels: Contains any number of key-value string pairs. Vertex AI uses this field to specify the annotation set.
  • dataItemResourceLabels - Contains any number of key-value string pairs. Specifies the machine learning use of the data item, such as training, test, or validation.

Example JSON Lines

{"imageGcsUri": "gs://bucket/filename1.jpeg",  "classificationAnnotations": [{"displayName": "daisy"}, {"displayName": "full_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}{"imageGcsUri": "gs://bucket/filename2.gif",  "classificationAnnotations": [{"displayName": "dandelion"}, {"displayName": "medium_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}{"imageGcsUri": "gs://bucket/filename3.png",  "classificationAnnotations": [{"displayName": "roses"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}{"imageGcsUri": "gs://bucket/filename4.bmp",  "classificationAnnotations": [{"displayName": "sunflowers"}, {"displayName": "closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}{"imageGcsUri": "gs://bucket/filename5.tiff",  "classificationAnnotations": [{"displayName": "tulips"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}...

Object detection

Vertex AI uses the following publicly accessible schema whenexporting an object detection dataset. This schema dictates theformat of the data export files. The schema's structure follows theOpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_bounding_box_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.

{  "imageGcsUri": "gs://bucket/filename.ext",  "boundingBoxAnnotations": [    {      "displayName": "OBJECT1_LABEL",      "xMin": "X_MIN",      "yMin": "Y_MIN",      "xMax": "X_MAX",      "yMax": "Y_MAX",      "annotationResourceLabels": {        "aiplatform.googleapis.com/annotation_set_name": "displayName",        "env": "prod"      }    },    {      "displayName": "OBJECT2_LABEL",      "xMin": "X_MIN",      "yMin": "Y_MIN",      "xMax": "X_MAX",      "yMax": "Y_MAX"    }  ],  "dataItemResourceLabels": {    "aiplatform.googleapis.com/ml_use": "test/train/validation"  }}

Field notes:

  • imageGcsUri: The Cloud Storage URI of this image.
  • annotationResourceLabels: Contains any number of key-value string pairs. Vertex AI uses this field to specify the annotation set.
  • dataItemResourceLabels - Contains any number of key-value string pairs. Specifies the machine learning use of the data item, such as training, test, or validation.

Example JSON Lines

{"imageGcsUri": "gs://bucket/filename1.jpeg", "boundingBoxAnnotations": [{"displayName": "Tomato", "xMin": "0.3", "yMin": "0.3", "xMax": "0.7", "yMax": "0.6"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}{"imageGcsUri": "gs://bucket/filename2.gif", "boundingBoxAnnotations": [{"displayName": "Tomato", "xMin": "0.8", "yMin": "0.2", "xMax": "1.0", "yMax": "0.4"},{"displayName": "Salad", "xMin": "0.0", "yMin": "0.0", "xMax": "1.0", "yMax": "1.0"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}{"imageGcsUri": "gs://bucket/filename3.png", "boundingBoxAnnotations": [{"displayName": "Baked goods", "xMin": "0.5", "yMin": "0.7", "xMax": "0.8", "yMax": "0.8"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}{"imageGcsUri": "gs://bucket/filename4.tiff", "boundingBoxAnnotations": [{"displayName": "Salad", "xMin": "0.1", "yMin": "0.2", "xMax": "0.8", "yMax": "0.9"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}...

Tabular datasets

Vertex AI passes tabular data to your training application inCSV format or as a URI to a BigQuery table or view. For moreinformation about the data source format and requirements, seePreparing your import source. Refer tothe dataset in Google Cloud console for more information about thedataset schema.

Text datasets

Text datasets are passed to your training application in JSON Lines format.Select the tab for your dataset's objective, to learn more about howVertex AI formats your dataset.

Single-label classification

Vertex AI uses the following publicly accessible schema whenexporting a single-label text classification dataset. This schema dictates theformat of the data export files. The schema's structure follows theOpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_single_label_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.

{  "classificationAnnotation": {    "displayName": "label"  },  "textContent": "inline_text",  "dataItemResourceLabels": {    "aiplatform.googleapis.com/ml_use": "training|test|validation"  }}{  "classificationAnnotation": {    "displayName": "label2"  },  "textGcsUri": "gcs_uri_to_file",  "dataItemResourceLabels": {    "aiplatform.googleapis.com/ml_use": "training|test|validation"  }}

Multi-label classification

Vertex AI uses the following publicly accessible schema whenexporting a multi-label text classification dataset. This schema dictates theformat of the data export files. The schema's structure follows theOpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_multi_label_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.

{  "classificationAnnotations": [{    "displayName": "label1"    },{    "displayName": "label2"  }],  "textGcsUri": "gcs_uri_to_file",  "dataItemResourceLabels": {    "aiplatform.googleapis.com/ml_use": "training|test|validation"  }}{  "classificationAnnotations": [{    "displayName": "label2"    },{    "displayName": "label3"  }],  "textContent": "inline_text",  "dataItemResourceLabels": {    "aiplatform.googleapis.com/ml_use": "training|test|validation"  }}

Entity extraction

Vertex AI uses the following publicly accessible schema whenexporting an entity extraction dataset. This schema dictates theformat of the data export files. The schema's structure follows theOpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml.

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.

{    "textSegmentAnnotations": [      {        "startOffset":number,        "endOffset":number,        "displayName": "label"      },      ...    ],    "textContent": "inline_text",    "dataItemResourceLabels": {      "aiplatform.googleapis.com/ml_use": "training|test|validation"    }}{    "textSegmentAnnotations": [      {        "startOffset":number,        "endOffset":number,        "displayName": "label"      },      ...    ],    "textGcsUri": "gcs_uri_to_file",    "dataItemResourceLabels": {      "aiplatform.googleapis.com/ml_use": "training|test|validation"    }}

Sentiment analysis

Vertex AI uses the following publicly accessible schema whenexporting a sentiment analysis dataset. This schema dictates theformat of the data export files. The schema's structure follows theOpenAPI schema.

gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_text_sentiment_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.

{  "sentimentAnnotation": {    "sentiment":number,    "sentimentMax":number  },  "textContent": "inline_text",  "dataItemResourceLabels": {    "aiplatform.googleapis.com/ml_use": "training|test|validation"  }}{  "sentimentAnnotation": {    "sentiment":number,    "sentimentMax":number  },  "textGcsUri": "gcs_uri_to_file",  "dataItemResourceLabels": {    "aiplatform.googleapis.com/ml_use": "training|test|validation"  }}

Video datasets

Video datasets are passed to your training application in JSON Lines format.Select the tab for your dataset's objective, to learn more about howVertex AI formats your dataset.

Action recognition

Vertex AI uses the following publicly accessible schema whenexporting an action recognition dataset. This schema dictates theformat of the data export files. The schema's structure follows theOpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/video_action_recognition_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.

{  "videoGcsUri': "gs://bucket/filename.ext",  "timeSegments": [{    "startTime": "start_time_of_fully_annotated_segment",    "endTime": "end_time_of_segment"}],  "timeSegmentAnnotations": [{    "displayName": "LABEL",    "startTime": "start_time_of_segment",    "endTime": "end_time_of_segment"  }],  "dataItemResourceLabels": {    "ml_use": "train|test"  }}

Note: The time segments here are used to calculate the timestampsof the actions.startTime andendTime oftimeSegmentAnnotations canbe equal, and corresponds to the key frame of the action.

Example JSON Lines

{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"ml_use": "training"}}{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"ml_use": "test"}}...

Classification

Vertex AI uses the following publicly accessible schema whenexporting a classification dataset. This schema dictates theformat of the data export files. The schema's structure follows theOpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/video_classification_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.

{"videoGcsUri": "gs://bucket/filename.ext","timeSegmentAnnotations": [{"displayName": "LABEL","startTime": "start_time_of_segment","endTime": "end_time_of_segment"}],"dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "train|test"}}

Example JSON Lines - Video classification:

{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}...

Object tracking

Vertex AI uses the following publicly accessible schema whenexporting an object tracking dataset. This schema dictates theformat of the data export files. The schema's structure follows theOpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/object_tracking_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.

{"videoGcsUri": "gs://bucket/filename.ext","TemporalBoundingBoxAnnotations": [{"displayName": "LABEL","xMin": "leftmost_coordinate_of_the_bounding box","xMax": "rightmost_coordinate_of_the_bounding box","yMin": "topmost_coordinate_of_the_bounding box","yMax": "bottommost_coordinate_of_the_bounding box","timeOffset": "timeframe_object-detected"                "instanceId": "instance_of_object                "annotationResourceLabels": "resource_labels"}],"dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "train|test"}}

Example JSON Lines

{'videoGcsUri': 'gs://demo-data/video1.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '4.000000s', 'xMin': '0.668912', 'yMin': '0.560642', 'xMax': '1.000000', 'yMax': '1.000000'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}{'videoGcsUri': 'gs://demo-data/video2.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '71.000000s', 'xMin': '0.679056', 'yMin': '0.070957', 'xMax': '0.801716', 'yMax': '0.290358'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}...

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.