Create a Vertex AI tabular dataset

The model you create later in this tutorial requires adataset to train it.The data that this tutorial uses is a publicly available dataset that containsdetails about three species of penguins. The following data are used to predictwhich of the three species a penguin is.

island - The island where a species of penguin is found.
culmen_length_mm - The length of the ridge along the top of the bill of a penguin.
culmen_depth_mm - The height of the bill of a penguin.
flipper_length_mm - The length of the flipper-like wing of a penguin.
body_mass_g - The mass of the body of a penguin.
sex - The sex of the penguin.

Download, preprocess, and split the data

In this section, you download the publicly available BigQuery datasetand prepare its data. To prepare the data, you do the following:

Convert categorical features (features described with a string instead of anumber) to numeric data. For example, you convert the names of the three typesof penguins to the numerical values0,1, and2.
Remove any columns in the dataset that aren't used.
Remove any rows that cannot be used.
Split the data into two distinct sets of data. Each set of data is stored in apandasDataFrameobject.
- Thedf_trainDataFrame contains data used to train your model.
- thedf_for_predictionDataFrame contains data used to generate predictions.

After processing the data, the code maps the three categorical columns'numerical values to their string values, then prints them so that you can seewhat the data looks like.

To download and process your data, run the following code in your notebook:

importnumpyasnpimportpandasaspdLABEL_COLUMN="species"# Define the BigQuery source datasetBQ_SOURCE="bigquery-public-data.ml_datasets.penguins"# Define NA valuesNA_VALUES=["NA","."]# Download a tabletable=bq_client.get_table(BQ_SOURCE)df=bq_client.list_rows(table).to_dataframe()# Drop unusable rowsdf=df.replace(to_replace=NA_VALUES,value=np.NaN).dropna()# Convert categorical columns to numericdf["island"],island_values=pd.factorize(df["island"])df["species"],species_values=pd.factorize(df["species"])df["sex"],sex_values=pd.factorize(df["sex"])# Split into a training and holdout datasetdf_train=df.sample(frac=0.8,random_state=100)df_for_prediction=df[~df.index.isin(df_train.index)]# Map numeric values to string valuesindex_to_island=dict(enumerate(island_values))index_to_species=dict(enumerate(species_values))index_to_sex=dict(enumerate(sex_values))# View the mapped island, species, and sex dataprint(index_to_island)print(index_to_species)print(index_to_sex)

The following are the printed mapped values for characteristics that are notnumeric:

{0: 'Dream', 1: 'Biscoe', 2: 'Torgersen'}{0: 'Adelie Penguin (Pygoscelis adeliae)', 1: 'Chinstrap penguin (Pygoscelis antarctica)', 2: 'Gentoo penguin (Pygoscelis papua)'}{0: 'FEMALE', 1: 'MALE'}

The first three values are the islands a penguin might inhabit. The second threevalues are important because they map to the predictions you receive at the endof this tutorial. The third row shows theFEMALE sex characteristic maps to0 and theMALE the sex characteristic maps to1.

Create a tabular dataset for training your model

In the previous step you downloaded and processed your data. In this step, youload the data stored in yourdf_trainDataFrame into a BigQuerydataset. Then, you use the BigQuery dataset to create aVertex AI tabular dataset. This tabular dataset is used to train yourmodel. For more information, seeUse manageddatasets.

Create a BigQuery dataset

To create your BigQuery dataset that's used to create aVertex AI dataset, run the following code. Thecreate_dataset commandreturns a new BigQueryDataSet.

# Create a BigQuery datasetbq_dataset_id = f"{project_id}.dataset_id_unique"bq_dataset = bigquery.Dataset(bq_dataset_id)bq_client.create_dataset(bq_dataset, exists_ok=True)