Use BigQuery ML to predict penguin weight

In this tutorial, you use alinear regression modelin BigQuery ML to predict the weight of a penguin based on thepenguin's demographic information. A linear regression is a type of regressionmodel that generates a continuous value from a linear combination of inputfeatures.

This tutorial uses thebigquery-public-data.ml_datasets.penguinsdataset.

Objectives

In this tutorial, you will perform the following tasks:

  • Create a linear regression model.
  • Evaluate the model.
  • Make predictions by using the model.
Note: This tutorial covers linear regression in pure SQL. For a linearregression tutorial using Python and BigQuery DataFrames on the same dataset,seeCreate a regression model with BigQuery DataFrames.

Costs

This tutorial uses billable components of Google Cloud,including the following:

  • BigQuery
  • BigQuery ML

For more information on BigQuery costs, see theBigQuery pricing page.

For more information on BigQuery ML costs, seeBigQuery ML pricing.

Before you begin

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  2. Verify that billing is enabled for your Google Cloud project.

  3. Enable the BigQuery API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the API

Required permissions

To create the model using BigQuery ML, you need the followingIAM permissions:

  • bigquery.jobs.create
  • bigquery.models.create
  • bigquery.models.getData
  • bigquery.models.updateData
  • bigquery.models.updateMetadata

To run inference, you need the following permissions:

  • bigquery.models.getData on the model
  • bigquery.jobs.create

Create a dataset

Create a BigQuery dataset to store your ML model.

Console

  1. In the Google Cloud console, go to theBigQuery page.

    Go to the BigQuery page

  2. In theExplorer pane, click your project name.

  3. ClickView actions > Create dataset

  4. On theCreate dataset page, do the following:

    • ForDataset ID, enterbqml_tutorial.

    • ForLocation type, selectMulti-region, and then selectUS (multiple regions in United States).

    • Leave the remaining default settings as they are, and clickCreate dataset.

bq

To create a new dataset, use thebq mk commandwith the--location flag. For a full list of possible parameters, see thebq mk --dataset commandreference.

  1. Create a dataset namedbqml_tutorial with the data location set toUSand a description ofBigQuery ML tutorial dataset:

    bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial

    Instead of using the--dataset flag, the command uses the-d shortcut.If you omit-d and--dataset, the command defaults to creating adataset.

  2. Confirm that the dataset was created:

    bqls

API

Call thedatasets.insertmethod with a defineddataset resource.

{"datasetReference":{"datasetId":"bqml_tutorial"}}

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

importgoogle.cloud.bigquerybqclient=google.cloud.bigquery.Client()bqclient.create_dataset("bqml_tutorial",exists_ok=True)

Create the model

Create a linear regression model using the Analytics sampledataset for BigQuery.

SQL

You can create a linear regression model by using theCREATE MODEL statementand specifyingLINEAR_REG for the model type. Creating the model includestraining the model.

The following are useful things to know about theCREATE MODEL statement:

  • Theinput_label_colsoption specifies which column in theSELECT statementto use as the label column. Here, the label column isbody_mass_g. Forlinear regression models, the label column must be real-valued, that is,the column values must be real numbers.
  • This query'sSELECT statement uses the following columns in thebigquery-public-data.ml_datasets.penguins table to predict a penguin'sweight:

    • species: the species of penguin.
    • island: the island that the penguin resides on.
    • culmen_length_mm: the length of the penguin's culmen in millimeters.
    • culmen_depth_mm: the depth of the penguin's culmen in millimeters.
    • flipper_length_mm: the length of the penguin's flippers in millimeters.
    • sex: the sex of the penguin.
  • TheWHERE clause in this query'sSELECT statement,WHERE body_mass_g ISNOT NULL, excludes rows where thebody_mass_g column isNULL.

Run the query that creates your linear regression model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, run the following query:

    CREATEORREPLACEMODEL`bqml_tutorial.penguins_model`OPTIONS(model_type='linear_reg',input_label_cols=['body_mass_g'])ASSELECT*FROM`bigquery-public-data.ml_datasets.penguins`WHEREbody_mass_gISNOTNULL;
  3. It takes about 30 seconds to create thepenguins_model model.

    To see the model, follow these steps:

    1. In the left pane, clickExplorer:

      Highlighted button for the Explorer pane.

      If you don't see the left pane, clickExpand left pane to open the pane.

    2. In theExplorer pane, expand your project and clickDatasets.

    3. Click thebqml_tutorial dataset.

    4. Click theModels tab.

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

frombigframes.ml.linear_modelimportLinearRegressionimportbigframes.pandasasbpd# Load data from BigQuerybq_df=bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")# Drop rows with nulls to get training datatraining_data=bq_df.dropna(subset=["body_mass_g"])# Specify your feature (or input) columns and the label (or output) column:feature_columns=training_data.drop(columns=["body_mass_g"])label_columns=training_data[["body_mass_g"]]# Create the linear modelmodel=LinearRegression()model.fit(feature_columns,label_columns)model.to_gbq(your_model_id,# For example: "bqml_tutorial.penguins_model"replace=True,)

It takes about 30 seconds to create the model. To see the model, follow thesesteps:

  1. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

  2. In theExplorer pane, expand your project and clickDatasets.

  3. Click thebqml_tutorial dataset.

  4. Click theModels tab.

Get training statistics

To see the results of the model training, you can use theML.TRAINING_INFO function,or you can view the statistics in the Google Cloud console. In thistutorial, you use the Google Cloud console.

A machine learning algorithm builds a model by examining many examples andattempting to find a model that minimizes loss. This process is called empiricalrisk minimization.

Loss is the penalty for a bad prediction. It is a number indicatinghow bad the model's prediction was on a single example. If the model'sprediction is perfect, the loss is zero; otherwise, the loss is greater. Thegoal of training a model is to find a set of weights and biases that have lowloss, on average, across all examples.

See the model training statistics that were generated when you ran theCREATE MODEL query:

  1. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

  2. In theExplorer pane, expand your project and clickDatasets.

  3. Click thebqml_tutorial dataset.

  4. Click theModels tab.

  5. To open the model information pane, clickpenguins_model.

  6. Click theTraining tab, and then clickTable. The results should looksimilar to the following:

    ML.TRAINING_INFO output

    TheTraining Data Loss column represents the loss metric calculatedafter the model is trained on the training dataset. Since you performed alinear regression, this column shows themean squared errorvalue. Anormal_equationoptimization strategy is automatically used for this training, so only oneiteration is required to converge to the final model. For more informationon setting the model optimization strategy, seeoptimize_strategy.

Evaluate the model

After creating the model, evaluate the model's performance by using theML.EVALUATE function or thescore BigQuery DataFrames function to evaluate the predicted values generated by the model against the actual data.

SQL

For input, theML.EVALUATE function takes the trained model and a datasetthat matches the schema of the data that you used to train the model. Ina production environment, you shouldevaluate the model on different data than the data you used to train the model.If you runML.EVALUATE without providing input data, the function retrievesthe evaluation metrics calculated during training. These metrics are calculatedby using the automatically reserved evaluation dataset:

SELECT*FROMML.EVALUATE(MODELbqml_tutorial.penguins_model);

Run theML.EVALUATE query:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, run the following query:

    SELECT*FROMML.EVALUATE(MODEL`bqml_tutorial.penguins_model`,(SELECT*FROM`bigquery-public-data.ml_datasets.penguins`WHEREbody_mass_gISNOTNULL));

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

importbigframes.pandasasbpd# Select the model you will be evaluating. `read_gbq_model` loads model data from# BigQuery, but you could also use the `model` object from the previous steps.model=bpd.read_gbq_model(your_model_id,# For example: "bqml_tutorial.penguins_model")# Score the model with input data defined in an earlier step to compare# model predictions on feature_columns to true labels in label_columns.score=model.score(feature_columns,label_columns)# Expected output results:# index  mean_absolute_error  mean_squared_error  mean_squared_log_error  median_absolute_error  r2_score  explained_variance#   0        227.012237         81838.159892            0.00507                173.080816        0.872377    0.872377#   1 rows x 6 columns

The results should look similar to the following:

ML.EVALUATE output

Because you performed a linear regression, the results include the followingcolumns:

  • mean_absolute_error
  • mean_squared_error
  • mean_squared_log_error
  • median_absolute_error
  • r2_score
  • explained_variance

An important metric in the evaluation results is theR2 score.The R2 score is a statistical measure that determines if the linearregression predictions approximate the actual data. A value of0 indicatesthat the model explains none of the variability of the response data around themean. A value of1 indicates that the model explains all the variability ofthe response data around the mean.

You can also look at the model's information pane in the Google Cloud consoleto view the evaluation metrics:

ML.EVALUATE output

Use the model to predict outcomes

Now that you have evaluated your model, the next step is to use it to predictan outcome. You can run theML.PREDICT function or thepredict BigQuery DataFrames functionon the model to predict the body mass in grams of all penguins that reside onthe Biscoe Islands.

SQL

For input, theML.PREDICT function takes the trained model and a dataset thatmatches the schema of the data that you used to train the model, excluding thelabel column.

Run theML.PREDICT query:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, run the following query:

    SELECT*FROMML.PREDICT(MODEL`bqml_tutorial.penguins_model`,(SELECT*FROM`bigquery-public-data.ml_datasets.penguins`WHEREisland='Biscoe'));

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

# Select the model you'll use for predictions. `read_gbq_model` loads# model data from BigQuery, but you could also use the `model` object# object from previous steps.model=bpd.read_gbq_model(your_model_id,# For example: "bqml_tutorial.penguins_model",)# Load data from BigQuerybq_df=bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")# Use 'contains' function to filter by island containing the string# "Biscoe".biscoe_data=bq_df.loc[bq_df["island"].str.contains("Biscoe")]result=model.predict(biscoe_data)# Expected output results:#     predicted_body_mass_g        species                island culmen_length_mm  culmen_depth_mm   body_mass_g flipper_length_mmsex# 23  4681.782896   Gentoo penguin (Pygoscelis papua)Biscoe      <NA>            <NA>        <NA>          <NA>        <NA># 332  4740.7907       Gentoo penguin (Pygoscelis papua)Biscoe      46.2            14.4        214.0          4650.0    <NA># 160  4731.310452   Gentoo penguin (Pygoscelis papua)Biscoe      44.5            14.3        216.0          4100.0    <NA>

The results should look similar to the following:

ML.PREDICT output

Explain the prediction results

SQL

To understand why the model is generating these prediction results, you can usetheML.EXPLAIN_PREDICT function.

ML.EXPLAIN_PREDICT is an extended version of theML.PREDICT function.ML.EXPLAIN_PREDICT not only outputs prediction results, but also outputsadditional columns to explain the prediction results. In practice, you can runML.EXPLAIN_PREDICT instead ofML.PREDICT. For more information, seeBigQuery ML explainable AI overview.

Run theML.EXPLAIN_PREDICT query:

  1. In the Google Cloud console, go to theBigQuery page.

Go to BigQuery

  1. In the query editor, run the following query:
SELECT*FROMML.EXPLAIN_PREDICT(MODEL`bqml_tutorial.penguins_model`,(SELECT*FROM`bigquery-public-data.ml_datasets.penguins`WHEREisland='Biscoe'),STRUCT(3astop_k_features));
  1. The results should look similar to the following:

    ML.EXPLAIN_PREDICT output

Note: TheML.EXPLAIN_PREDICT query outputs all the input feature columns,similar to whatML.PREDICT does. For readability purposes, only one featurecolumn,species, is shown in the preceding figure.

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

# Use 'predict_explain' function to understand why the model is generating these prediction results.# 'predict_explain'is an extended version of the 'predict' function that not only outputs prediction results, but also outputs additional columns to explain the prediction results.# Using the trained model and utilizing data specific to Biscoe Island, explain the predictions of the top 3 featuresexplained=model.predict_explain(biscoe_data,top_k_features=3)# Expected results:#   predicted_body_mass_g               top_feature_attributions        baseline_prediction_valueprediction_valueapproximation_error              species            islandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_g    sex# 0 5413.510134        [{'feature': 'island', 'attribution': 7348.877...-5320.222128          5413.510134            0.0         Gentoo penguin (Pygoscelis papua)Biscoe    45.2              16.4        223.0           5950.0    MALE# 1 4768.351092            [{'feature': 'island', 'attribution': 7348.877...-5320.222128          4768.351092            0.0         Gentoo penguin (Pygoscelis papua)Biscoe    46.5              14.5        213.0           4400.0   FEMALE# 2 3235.896372        [{'feature': 'island', 'attribution': 7348.877...-5320.222128          3235.896372            0.0        Adelie Penguin (Pygoscelis adeliae)Biscoe    37.7              16.0          183.0           3075.0   FEMALE# 3 5349.603734        [{'feature': 'island', 'attribution': 7348.877...-5320.222128          5349.603734            0.0         Gentoo penguin (Pygoscelis papua)Biscoe    46.4              15.6        221.0           5000.0    MALE# 4 4637.165037        [{'feature': 'island', 'attribution': 7348.877...-5320.222128          4637.165037            0.0         Gentoo penguin (Pygoscelis papua)Biscoe    46.1              13.2        211.0           4500.0   FEMALE

For linear regression models, Shapley values are used to generate featureattribution values for each feature in the model. The output includesthe top three feature attributions per row of thepenguins table becausetop_k_features was set to3. These attributions are sorted bythe absolute value of the attribution in descending order. In all examples, thefeaturesex contributed the most to the overall prediction.

Globally explain the model

SQL

To know which features are generally the most important to determine penguinweight, you can use theML.GLOBAL_EXPLAIN function.In order to useML.GLOBAL_EXPLAIN, you must retrain the model with theENABLE_GLOBAL_EXPLAIN option set toTRUE.

Retrain and get global explanations for the model:

  1. In the Google Cloud console, go to theBigQuery page.

Go to BigQuery

  1. In the query editor, run the following query to retrain the model:

    #standardSQLCREATEORREPLACEMODEL`bqml_tutorial.penguins_model`OPTIONS(model_type='linear_reg',input_label_cols=['body_mass_g'],enable_global_explain=TRUE)ASSELECT*FROM`bigquery-public-data.ml_datasets.penguins`WHEREbody_mass_gISNOTNULL;
  2. In the query editor, run the following query to get global explanations:

    SELECT*FROMML.GLOBAL_EXPLAIN(MODEL`bqml_tutorial.penguins_model`)
  3. The results should look similar to the following:

    ML.GLOBAL_EXPLAIN output

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

# To use the `global_explain()` function, the model must be recreated with `enable_global_explain` set to `True`.model=LinearRegression(enable_global_explain=True)# The model must the be fitted before it can be saved to BigQuery and then explained.training_data=bq_df.dropna(subset=["body_mass_g"])X=training_data.drop(columns=["body_mass_g"])y=training_data[["body_mass_g"]]model.fit(X,y)model.to_gbq("bqml_tutorial.penguins_model",replace=True)# Explain the modelexplain_model=model.global_explain()# Expected results:#                       attribution# feature# island            5737.315921# species            4073.280549# sex                622.070896# flipper_length_mm    193.612051# culmen_depth_mm    117.084944# culmen_length_mm    94.366793

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If youprefer to reuse the project, you can delete the dataset you created in thistutorial:

  1. If necessary, open the BigQuery page in theGoogle Cloud console.

    Go to the BigQuery page

  2. In the navigation, click thebqml_tutorial dataset you created.

  3. ClickDelete dataset on the right side of the window.This action deletes the dataset, the table, and all the data.

  4. In theDelete dataset dialog box, confirm the delete command by typingthe name of your dataset (bqml_tutorial) and then clickDelete.

Delete your project

To delete the project:

    Caution: Deleting a project has the following effects:
    • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
    • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

    If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

  1. In the Google Cloud console, go to theManage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then clickDelete.
  3. In the dialog, type the project ID, and then clickShut down to delete the project.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.