Create a Vertex AI tabular dataset Stay organized with collections Save and categorize content based on your preferences.
island- The island where a species of penguin is found.culmen_length_mm- The length of the ridge along the top of the bill of a penguin.culmen_depth_mm- The height of the bill of a penguin.flipper_length_mm- The length of the flipper-like wing of a penguin.body_mass_g- The mass of the body of a penguin.sex- The sex of the penguin.
Download, preprocess, and split the data
In this section, you download the publicly available BigQuery datasetand prepare its data. To prepare the data, you do the following:
Convert categorical features (features described with a string instead of anumber) to numeric data. For example, you convert the names of the three typesof penguins to the numerical values
0,1, and2.Remove any columns in the dataset that aren't used.
Remove any rows that cannot be used.
Split the data into two distinct sets of data. Each set of data is stored in apandas
DataFrameobject.The
df_trainDataFramecontains data used to train your model.the
df_for_predictionDataFramecontains data used to generate predictions.
After processing the data, the code maps the three categorical columns'numerical values to their string values, then prints them so that you can seewhat the data looks like.
To download and process your data, run the following code in your notebook:
importnumpyasnpimportpandasaspdLABEL_COLUMN="species"# Define the BigQuery source datasetBQ_SOURCE="bigquery-public-data.ml_datasets.penguins"# Define NA valuesNA_VALUES=["NA","."]# Download a tabletable=bq_client.get_table(BQ_SOURCE)df=bq_client.list_rows(table).to_dataframe()# Drop unusable rowsdf=df.replace(to_replace=NA_VALUES,value=np.NaN).dropna()# Convert categorical columns to numericdf["island"],island_values=pd.factorize(df["island"])df["species"],species_values=pd.factorize(df["species"])df["sex"],sex_values=pd.factorize(df["sex"])# Split into a training and holdout datasetdf_train=df.sample(frac=0.8,random_state=100)df_for_prediction=df[~df.index.isin(df_train.index)]# Map numeric values to string valuesindex_to_island=dict(enumerate(island_values))index_to_species=dict(enumerate(species_values))index_to_sex=dict(enumerate(sex_values))# View the mapped island, species, and sex dataprint(index_to_island)print(index_to_species)print(index_to_sex)The following are the printed mapped values for characteristics that are notnumeric:
{0: 'Dream', 1: 'Biscoe', 2: 'Torgersen'}{0: 'Adelie Penguin (Pygoscelis adeliae)', 1: 'Chinstrap penguin (Pygoscelis antarctica)', 2: 'Gentoo penguin (Pygoscelis papua)'}{0: 'FEMALE', 1: 'MALE'}The first three values are the islands a penguin might inhabit. The second threevalues are important because they map to the predictions you receive at the endof this tutorial. The third row shows theFEMALE sex characteristic maps to0 and theMALE the sex characteristic maps to1.
Create a tabular dataset for training your model
In the previous step you downloaded and processed your data. In this step, youload the data stored in yourdf_trainDataFrame into a BigQuerydataset. Then, you use the BigQuery dataset to create aVertex AI tabular dataset. This tabular dataset is used to train yourmodel. For more information, seeUse manageddatasets.
Create a BigQuery dataset
To create your BigQuery dataset that's used to create aVertex AI dataset, run the following code. Thecreate_dataset commandreturns a new BigQueryDataSet.
# Create a BigQuery datasetbq_dataset_id = f"{project_id}.dataset_id_unique"bq_dataset = bigquery.Dataset(bq_dataset_id)bq_client.create_dataset(bq_dataset, exists_ok=True)Create a Vertex AI tabular dataset
To convert your BigQuery dataset a Vertex AI tabulardataset, run the following code. You can ignore the warning about the requirednumber of rows to train using tabular data. Because the purpose of this tutorialis to quickly show you how to get predictions, a relatively small set of data isused to show you how to generate predictions. In a real world scenario, you wantat least 1000 rows in a tabular dataset. Thecreate_from_dataframecommand returns a Vertex AITabularDataset.
# Create a Vertex AI tabular datasetdataset = aiplatform.TabularDataset.create_from_dataframe( df_source=df_train, staging_path=f"bq://{bq_dataset_id}.table-unique", display_name="sample-penguins",)You now have the Vertex AI tabular dataset used to train your model.
(Optional) View the public dataset in BigQuery
If you want to view the public data used in this tutorial, you can open it inBigQuery.
InSearch in the Google Cloud, enter BigQuery, then pressreturn.
In the search results, click on BigQuery
In theExplorer window, expandbigquery-public-data.
Underbigquery-public-data, expandml_datasets, then clickpenguins.
Click any of the names underField name to view that field's data.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-24 UTC.