Create a dataset for training classification and regression models

This page shows you how to create a Vertex AI dataset from yourtabular data so you can start training classification and regression models. Youcan create a dataset using either the Google Cloud console or theVertex AI API.

Before you begin

Before you create a Vertex AI dataset from your tabular data,first prepare your data. For details, see:

Create an empty dataset and associate your prepared data

To create a machine learning model for classification or regression, you mustfirst have a representative collection of data to train with. Use theGoogle Cloud console or the API to associate your prepared data into thedataset. Associating your data lets you make modifications and start modeltraining.

Google Cloud console

  1. In the Google Cloud console, in the Vertex AI section, go to theDatasets page.

    Go to the Datasets page

  2. ClickCreate to open the create dataset details page.
  3. Modify theDataset name field to create a descriptive dataset display name.
  4. Select theTabular tab.
  5. Select theRegression/classification objective.
  6. Select a region from theRegion drop-down list.
  7. If you want to usecustomer-managed encryption keys (CMEK) with your dataset, openAdvanced options and provide your key. (Preview)
  8. ClickCreate to create your empty dataset, and advance to theSource tab.
  9. Choose one of the following options, based on your data source. Tabular training data in Cloud Storage or BigQuery is not imported into Vertex AI. (When you import from local files, they are imported into Cloud Storage.) When you create a dataset with tabular data, the data is associated with the dataset. Changes you make to your data source in Cloud Storage or BigQuery after dataset creation are incorporated into models subsequently trained with that dataset. A snapshot of the dataset is taken when model training begins.

    CSV files on your computer

    1. ClickUpload CSV files from your computer.
    2. ClickSelect files and choose all the local files to upload to a Cloud Storage bucket.
    3. In theSelect a Cloud Storage path section enter the path to the Cloud Storage bucket or clickBrowse to choose a bucket location.

    CSV files in Cloud Storage

    1. ClickSelect CSV files from Cloud Storage.
    2. In theSelect CSV files from Cloud Storage section enter the path to the Cloud Storage bucket or clickBrowse to choose the location of your CSV files.

    A table or view in BigQuery

    1. ClickSelect a table or view from BigQuery.
    2. Enter the project, dataset, and table IDs for your input file.
  10. ClickContinue.

    Your data source is associated with your dataset.

API

When you create a dataset, you also associate it with its data source. The code needed to createa dataset depends on whether the training data resides inCloud Storage orBigQuery. If the data sourceresides in a different project, make sure youset up the required permissions. Tabular training data in Cloud Storage or BigQuery is not imported into Vertex AI. (When you import from local files, they are imported into Cloud Storage.) When you create a dataset with tabular data, the data is associated with the dataset. Changes you make to your data source in Cloud Storage or BigQuery after dataset creation are incorporated into models subsequently trained with that dataset. A snapshot of the dataset is taken when model training begins.

Creating a dataset with data in Cloud Storage

REST

You use thedatasets.create method to create a dataset.

Before using any of the request data, make the following replacements:

  • LOCATION: Region where the dataset will be stored. This must be aregion that supportsdataset resources. For example,us-central1.
  • PROJECT: Yourproject ID.
  • DATASET_NAME: Display name for the dataset.
  • METADATA_SCHEMA_URI: The URI to the schema file for your objective.gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml
  • URI: Paths (URIs) to the Cloud Storage buckets containing the training data. There can be more than one. Each URI has the form:
    gs://GCSprojectId/bucketName/fileName
  • PROJECT_NUMBER: Your project's automatically generatedproject number.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets

Request JSON body:

{  "display_name": "DATASET_NAME",  "metadata_schema_uri": "METADATA_SCHEMA_URI",  "metadata": {    "input_config": {      "gcs_source": {        "uri": [URI1,URI2, ...]      }    }  }}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets"

PowerShell

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{  "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID/operations/OPERATION_ID",  "metadata": {    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateDatasetOperationMetadata",    "genericMetadata": {      "createTime": "2020-07-07T21:27:35.964882Z",      "updateTime": "2020-07-07T21:27:35.964882Z"    }}

Java

Before trying this sample, follow theJava setup instructions in theVertex AI quickstart using client libraries. For more information, see theVertex AIJava API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.

importcom.google.api.gax.longrunning.OperationFuture;importcom.google.cloud.aiplatform.v1.CreateDatasetOperationMetadata;importcom.google.cloud.aiplatform.v1.Dataset;importcom.google.cloud.aiplatform.v1.DatasetServiceClient;importcom.google.cloud.aiplatform.v1.DatasetServiceSettings;importcom.google.cloud.aiplatform.v1.LocationName;importcom.google.protobuf.Value;importcom.google.protobuf.util.JsonFormat;importjava.io.IOException;importjava.util.concurrent.ExecutionException;importjava.util.concurrent.TimeUnit;importjava.util.concurrent.TimeoutException;publicclassCreateDatasetTabularGcsSample{publicstaticvoidmain(String[]args)throwsInterruptedException,ExecutionException,TimeoutException,IOException{// TODO(developer): Replace these variables before running the sample.Stringproject="YOUR_PROJECT_ID";StringdatasetDisplayName="YOUR_DATASET_DISPLAY_NAME";StringgcsSourceUri="gs://YOUR_GCS_SOURCE_BUCKET/path_to_your_gcs_table/file.csv";;createDatasetTableGcs(project,datasetDisplayName,gcsSourceUri);}staticvoidcreateDatasetTableGcs(Stringproject,StringdatasetDisplayName,StringgcsSourceUri)throwsIOException,ExecutionException,InterruptedException,TimeoutException{DatasetServiceSettingssettings=DatasetServiceSettings.newBuilder().setEndpoint("us-central1-aiplatform.googleapis.com:443").build();// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests. After completing all of your requests, call// the "close" method on the client to safely clean up any remaining background resources.try(DatasetServiceClientdatasetServiceClient=DatasetServiceClient.create(settings)){Stringlocation="us-central1";StringmetadataSchemaUri="gs://google-cloud-aiplatform/schema/dataset/metadata/tables_1.0.0.yaml";LocationNamelocationName=LocationName.of(project,location);StringjsonString="{\"input_config\": {\"gcs_source\": {\"uri\": [\""+gcsSourceUri+"\"]}}}";Value.BuildermetaData=Value.newBuilder();JsonFormat.parser().merge(jsonString,metaData);Datasetdataset=Dataset.newBuilder().setDisplayName(datasetDisplayName).setMetadataSchemaUri(metadataSchemaUri).setMetadata(metaData).build();OperationFuture<Dataset,CreateDatasetOperationMetadata>datasetFuture=datasetServiceClient.createDatasetAsync(locationName,dataset);System.out.format("Operation name: %s\n",datasetFuture.getInitialFuture().get().getName());System.out.println("Waiting for operation to finish...");DatasetdatasetResponse=datasetFuture.get(300,TimeUnit.SECONDS);System.out.println("Create Dataset Table GCS sample");System.out.format("Name: %s\n",datasetResponse.getName());System.out.format("Display Name: %s\n",datasetResponse.getDisplayName());System.out.format("Metadata Schema Uri: %s\n",datasetResponse.getMetadataSchemaUri());System.out.format("Metadata: %s\n",datasetResponse.getMetadata());}}}

Node.js

Before trying this sample, follow theNode.js setup instructions in theVertex AI quickstart using client libraries. For more information, see theVertex AINode.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.

/** * TODO(developer): Uncomment these variables before running the sample.\ * (Not necessary if passing values as arguments) */// const datasetDisplayName = 'YOUR_DATASET_DISPLAY_NAME';// const gcsSourceUri = 'YOUR_GCS_SOURCE_URI';// const project = 'YOUR_PROJECT_ID';// const location = 'YOUR_PROJECT_LOCATION';// Imports the Google Cloud Dataset Service Client libraryconst{DatasetServiceClient}=require('@google-cloud/aiplatform');// Specifies the location of the api endpointconstclientOptions={apiEndpoint:'us-central1-aiplatform.googleapis.com',};// Instantiates a clientconstdatasetServiceClient=newDatasetServiceClient(clientOptions);asyncfunctioncreateDatasetTabularGcs(){// Configure the parent resourceconstparent=`projects/${project}/locations/${location}`;constmetadata={structValue:{fields:{inputConfig:{structValue:{fields:{gcsSource:{structValue:{fields:{uri:{listValue:{values:[{stringValue:gcsSourceUri}],},},},},},},},},},},};// Configure the dataset resourceconstdataset={displayName:datasetDisplayName,metadataSchemaUri:'gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml',metadata:metadata,};constrequest={parent,dataset,};// Create dataset requestconst[response]=awaitdatasetServiceClient.createDataset(request);console.log(`Long running operation :${response.name}`);// Wait for operation to completeawaitresponse.promise();constresult=response.result;console.log('Create dataset tabular gcs response');console.log(`\tName :${result.name}`);console.log(`\tDisplay name :${result.displayName}`);console.log(`\tMetadata schema uri :${result.metadataSchemaUri}`);console.log(`\tMetadata :${JSON.stringify(result.metadata)}`);}createDatasetTabularGcs();

Python

To learn how to install or update the Vertex AI SDK for Python, seeInstall the Vertex AI SDK for Python. For more information, see thePython API reference documentation.

defcreate_and_import_dataset_tabular_gcs_sample(display_name:str,project:str,location:str,gcs_source:Union[str, List[str]],):aiplatform.init(project=project,location=location)dataset=aiplatform.TabularDataset.create(display_name=display_name,gcs_source=gcs_source,)dataset.wait()print(f'\tDataset: "{dataset.display_name}"')print(f'\tname: "{dataset.resource_name}"')

Creating a dataset with data in BigQuery

REST

You use thedatasets.createmethod to create a dataset.

Before using any of the request data, make the following replacements:

  • LOCATION: Region where the dataset will be stored. This must be aregion that supportsdataset resources. For example,us-central1.
  • PROJECT: .
  • DATASET_NAME: Display name for the dataset.
  • METADATA_SCHEMA_URI: The URI to the schema file for your objective.gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml
  • URI: Path to the BigQuery table containing the training data. In the form:
    bq://bqprojectId.bqDatasetId.bqTableId
  • PROJECT_NUMBER: Your project's automatically generatedproject number.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets

Request JSON body:

{  "display_name": "DATASET_NAME",  "metadata_schema_uri": "METADATA_SCHEMA_URI",  "metadata": {    "input_config": {      "bigquery_source" :{        "uri": "URI      }    }  }}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets"

PowerShell

Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list.

Save the request body in a file namedrequest.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{  "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID/operations/OPERATION_ID",  "metadata": {    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateDatasetOperationMetadata",    "genericMetadata": {      "createTime": "2020-07-07T21:27:35.964882Z",      "updateTime": "2020-07-07T21:27:35.964882Z"    }}

Java

Before trying this sample, follow theJava setup instructions in theVertex AI quickstart using client libraries. For more information, see theVertex AIJava API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.

importcom.google.api.gax.longrunning.OperationFuture;importcom.google.cloud.aiplatform.v1.CreateDatasetOperationMetadata;importcom.google.cloud.aiplatform.v1.Dataset;importcom.google.cloud.aiplatform.v1.DatasetServiceClient;importcom.google.cloud.aiplatform.v1.DatasetServiceSettings;importcom.google.cloud.aiplatform.v1.LocationName;importcom.google.protobuf.Value;importcom.google.protobuf.util.JsonFormat;importjava.io.IOException;importjava.util.concurrent.ExecutionException;importjava.util.concurrent.TimeUnit;importjava.util.concurrent.TimeoutException;publicclassCreateDatasetTabularBigquerySample{publicstaticvoidmain(String[]args)throwsInterruptedException,ExecutionException,TimeoutException,IOException{// TODO(developer): Replace these variables before running the sample.Stringproject="YOUR_PROJECT_ID";StringbigqueryDisplayName="YOUR_DATASET_DISPLAY_NAME";StringbigqueryUri="bq://YOUR_GOOGLE_CLOUD_PROJECT_ID.BIGQUERY_DATASET_ID.BIGQUERY_TABLE_OR_VIEW_ID";createDatasetTableBigquery(project,bigqueryDisplayName,bigqueryUri);}staticvoidcreateDatasetTableBigquery(Stringproject,StringbigqueryDisplayName,StringbigqueryUri)throwsIOException,ExecutionException,InterruptedException,TimeoutException{DatasetServiceSettingssettings=DatasetServiceSettings.newBuilder().setEndpoint("us-central1-aiplatform.googleapis.com:443").build();// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests. After completing all of your requests, call// the "close" method on the client to safely clean up any remaining background resources.try(DatasetServiceClientdatasetServiceClient=DatasetServiceClient.create(settings)){Stringlocation="us-central1";StringmetadataSchemaUri="gs://google-cloud-aiplatform/schema/dataset/metadata/tables_1.0.0.yaml";LocationNamelocationName=LocationName.of(project,location);StringjsonString="{\"input_config\": {\"bigquery_source\": {\"uri\": \""+bigqueryUri+"\"}}}";Value.BuildermetaData=Value.newBuilder();JsonFormat.parser().merge(jsonString,metaData);Datasetdataset=Dataset.newBuilder().setDisplayName(bigqueryDisplayName).setMetadataSchemaUri(metadataSchemaUri).setMetadata(metaData).build();OperationFuture<Dataset,CreateDatasetOperationMetadata>datasetFuture=datasetServiceClient.createDatasetAsync(locationName,dataset);System.out.format("Operation name: %s\n",datasetFuture.getInitialFuture().get().getName());System.out.println("Waiting for operation to finish...");DatasetdatasetResponse=datasetFuture.get(300,TimeUnit.SECONDS);System.out.println("Create Dataset Table Bigquery sample");System.out.format("Name: %s\n",datasetResponse.getName());System.out.format("Display Name: %s\n",datasetResponse.getDisplayName());System.out.format("Metadata Schema Uri: %s\n",datasetResponse.getMetadataSchemaUri());System.out.format("Metadata: %s\n",datasetResponse.getMetadata());}}}

Node.js

Before trying this sample, follow theNode.js setup instructions in theVertex AI quickstart using client libraries. For more information, see theVertex AINode.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.

/** * TODO(developer): Uncomment these variables before running the sample.\ * (Not necessary if passing values as arguments) */// const datasetDisplayName = 'YOUR_DATASET_DISPLAY_NAME';// const bigquerySourceUri = 'YOUR_BIGQUERY_SOURCE_URI';// const project = 'YOUR_PROJECT_ID';// const location = 'YOUR_PROJECT_LOCATION';// Imports the Google Cloud Dataset Service Client libraryconst{DatasetServiceClient}=require('@google-cloud/aiplatform');// Specifies the location of the api endpointconstclientOptions={apiEndpoint:'us-central1-aiplatform.googleapis.com',};// Instantiates a clientconstdatasetServiceClient=newDatasetServiceClient(clientOptions);asyncfunctioncreateDatasetTabularBigquery(){// Configure the parent resourceconstparent=`projects/${project}/locations/${location}`;constmetadata={structValue:{fields:{inputConfig:{structValue:{fields:{bigquerySource:{structValue:{fields:{uri:{listValue:{values:[{stringValue:bigquerySourceUri}],},},},},},},},},},},};// Configure the dataset resourceconstdataset={displayName:datasetDisplayName,metadataSchemaUri:'gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml',metadata:metadata,};constrequest={parent,dataset,};// Create dataset requestconst[response]=awaitdatasetServiceClient.createDataset(request);console.log(`Long running operation :${response.name}`);// Wait for operation to completeawaitresponse.promise();constresult=response.result;console.log('Create dataset tabular bigquery response');console.log(`\tName :${result.name}`);console.log(`\tDisplay name :${result.displayName}`);console.log(`\tMetadata schema uri :${result.metadataSchemaUri}`);console.log(`\tMetadata :${JSON.stringify(result.metadata)}`);}createDatasetTabularBigquery();

Python

To learn how to install or update the Vertex AI SDK for Python, seeInstall the Vertex AI SDK for Python. For more information, see thePython API reference documentation.

def create_and_import_dataset_tabular_bigquery_sample(    display_name: str,    project: str,    location: str,    bq_source: str,):    aiplatform.init(project=project, location=location)    dataset = aiplatform.TabularDataset.create(        display_name=display_name,        bq_source=bq_source,    )    dataset.wait()    print(f'\tDataset: "{dataset.display_name}"')    print(f'\tname: "{dataset.resource_name}"')

Get operation status

Some requests start long-running operations that require time to complete. Theserequests return an operation name, which you can use to view the operation'sstatus or cancel the operation. Vertex AI provides helper methodsto make calls against long-running operations. For more information, seeWorking with long-runningoperations.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.