Create a Vertex AI tabular dataset

The model you create later in this tutorial requires adataset to train it.The data that this tutorial uses is a publicly available dataset that containsdetails about three species of penguins. The following data are used to predictwhich of the three species a penguin is.

  • island - The island where a species of penguin is found.
  • culmen_length_mm - The length of the ridge along the top of the bill of a penguin.
  • culmen_depth_mm - The height of the bill of a penguin.
  • flipper_length_mm - The length of the flipper-like wing of a penguin.
  • body_mass_g - The mass of the body of a penguin.
  • sex - The sex of the penguin.

Download, preprocess, and split the data

In this section, you download the publicly available BigQuery datasetand prepare its data. To prepare the data, you do the following:

  • Convert categorical features (features described with a string instead of anumber) to numeric data. For example, you convert the names of the three typesof penguins to the numerical values0,1, and2.

  • Remove any columns in the dataset that aren't used.

  • Remove any rows that cannot be used.

  • Split the data into two distinct sets of data. Each set of data is stored in apandasDataFrameobject.

    • Thedf_trainDataFrame contains data used to train your model.

    • thedf_for_predictionDataFrame contains data used to generate predictions.

After processing the data, the code maps the three categorical columns'numerical values to their string values, then prints them so that you can seewhat the data looks like.

To download and process your data, run the following code in your notebook:

importnumpyasnpimportpandasaspdLABEL_COLUMN="species"# Define the BigQuery source datasetBQ_SOURCE="bigquery-public-data.ml_datasets.penguins"# Define NA valuesNA_VALUES=["NA","."]# Download a tabletable=bq_client.get_table(BQ_SOURCE)df=bq_client.list_rows(table).to_dataframe()# Drop unusable rowsdf=df.replace(to_replace=NA_VALUES,value=np.NaN).dropna()# Convert categorical columns to numericdf["island"],island_values=pd.factorize(df["island"])df["species"],species_values=pd.factorize(df["species"])df["sex"],sex_values=pd.factorize(df["sex"])# Split into a training and holdout datasetdf_train=df.sample(frac=0.8,random_state=100)df_for_prediction=df[~df.index.isin(df_train.index)]# Map numeric values to string valuesindex_to_island=dict(enumerate(island_values))index_to_species=dict(enumerate(species_values))index_to_sex=dict(enumerate(sex_values))# View the mapped island, species, and sex dataprint(index_to_island)print(index_to_species)print(index_to_sex)

The following are the printed mapped values for characteristics that are notnumeric:

{0: 'Dream', 1: 'Biscoe', 2: 'Torgersen'}{0: 'Adelie Penguin (Pygoscelis adeliae)', 1: 'Chinstrap penguin (Pygoscelis antarctica)', 2: 'Gentoo penguin (Pygoscelis papua)'}{0: 'FEMALE', 1: 'MALE'}

The first three values are the islands a penguin might inhabit. The second threevalues are important because they map to the predictions you receive at the endof this tutorial. The third row shows theFEMALE sex characteristic maps to0 and theMALE the sex characteristic maps to1.

Create a tabular dataset for training your model

In the previous step you downloaded and processed your data. In this step, youload the data stored in yourdf_trainDataFrame into a BigQuerydataset. Then, you use the BigQuery dataset to create aVertex AI tabular dataset. This tabular dataset is used to train yourmodel. For more information, seeUse manageddatasets.

Create a BigQuery dataset

To create your BigQuery dataset that's used to create aVertex AI dataset, run the following code. Thecreate_dataset commandreturns a new BigQueryDataSet.

# Create a BigQuery datasetbq_dataset_id = f"{project_id}.dataset_id_unique"bq_dataset = bigquery.Dataset(bq_dataset_id)bq_client.create_dataset(bq_dataset, exists_ok=True)

Create a Vertex AI tabular dataset

To convert your BigQuery dataset a Vertex AI tabulardataset, run the following code. You can ignore the warning about the requirednumber of rows to train using tabular data. Because the purpose of this tutorialis to quickly show you how to get predictions, a relatively small set of data isused to show you how to generate predictions. In a real world scenario, you wantat least 1000 rows in a tabular dataset. Thecreate_from_dataframecommand returns a Vertex AITabularDataset.

# Create a Vertex AI tabular datasetdataset = aiplatform.TabularDataset.create_from_dataframe(    df_source=df_train,    staging_path=f"bq://{bq_dataset_id}.table-unique",    display_name="sample-penguins",)

You now have the Vertex AI tabular dataset used to train your model.

(Optional) View the public dataset in BigQuery

If you want to view the public data used in this tutorial, you can open it inBigQuery.

  1. InSearch in the Google Cloud, enter BigQuery, then pressreturn.

  2. In the search results, click on BigQuery

  3. In theExplorer window, expandbigquery-public-data.

  4. Underbigquery-public-data, expandml_datasets, then clickpenguins.

  5. Click any of the names underField name to view that field's data.

View penguin public dataset.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.