Use BigQuery ML to predict penguin weight Stay organized with collections Save and categorize content based on your preferences.
In this tutorial, you use alinear regression modelin BigQuery ML to predict the weight of a penguin based on thepenguin's demographic information. A linear regression is a type of regressionmodel that generates a continuous value from a linear combination of inputfeatures.
This tutorial uses thebigquery-public-data.ml_datasets.penguinsdataset.
Objectives
In this tutorial, you will perform the following tasks:
- Create a linear regression model.
- Evaluate the model.
- Make predictions by using the model.
Costs
This tutorial uses billable components of Google Cloud,including the following:
- BigQuery
- BigQuery ML
For more information on BigQuery costs, see theBigQuery pricing page.
For more information on BigQuery ML costs, seeBigQuery ML pricing.
Before you begin
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.
Required permissions
To create the model using BigQuery ML, you need the followingIAM permissions:
bigquery.jobs.createbigquery.models.createbigquery.models.getDatabigquery.models.updateDatabigquery.models.updateMetadata
To run inference, you need the following permissions:
bigquery.models.getDataon the modelbigquery.jobs.create
Create a dataset
Create a BigQuery dataset to store your ML model.
Console
In the Google Cloud console, go to theBigQuery page.
In theExplorer pane, click your project name.
ClickView actions > Create dataset
On theCreate dataset page, do the following:
ForDataset ID, enter
bqml_tutorial.ForLocation type, selectMulti-region, and then selectUS (multiple regions in United States).
Leave the remaining default settings as they are, and clickCreate dataset.
bq
To create a new dataset, use thebq mk commandwith the--location flag. For a full list of possible parameters, see thebq mk --dataset commandreference.
Create a dataset named
bqml_tutorialwith the data location set toUSand a description ofBigQuery ML tutorial dataset:bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial
Instead of using the
--datasetflag, the command uses the-dshortcut.If you omit-dand--dataset, the command defaults to creating adataset.Confirm that the dataset was created:
bqls
API
Call thedatasets.insertmethod with a defineddataset resource.
{"datasetReference":{"datasetId":"bqml_tutorial"}}
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
importgoogle.cloud.bigquerybqclient=google.cloud.bigquery.Client()bqclient.create_dataset("bqml_tutorial",exists_ok=True)Create the model
Create a linear regression model using the Analytics sampledataset for BigQuery.
SQL
You can create a linear regression model by using theCREATE MODEL statementand specifyingLINEAR_REG for the model type. Creating the model includestraining the model.
The following are useful things to know about theCREATE MODEL statement:
- The
input_label_colsoption specifies which column in theSELECTstatementto use as the label column. Here, the label column isbody_mass_g. Forlinear regression models, the label column must be real-valued, that is,the column values must be real numbers. This query's
SELECTstatement uses the following columns in thebigquery-public-data.ml_datasets.penguinstable to predict a penguin'sweight:species: the species of penguin.island: the island that the penguin resides on.culmen_length_mm: the length of the penguin's culmen in millimeters.culmen_depth_mm: the depth of the penguin's culmen in millimeters.flipper_length_mm: the length of the penguin's flippers in millimeters.sex: the sex of the penguin.
The
WHEREclause in this query'sSELECTstatement,WHERE body_mass_g ISNOT NULL, excludes rows where thebody_mass_gcolumn isNULL.
Run the query that creates your linear regression model:
In the Google Cloud console, go to theBigQuery page.
In the query editor, run the following query:
CREATEORREPLACEMODEL`bqml_tutorial.penguins_model`OPTIONS(model_type='linear_reg',input_label_cols=['body_mass_g'])ASSELECT*FROM`bigquery-public-data.ml_datasets.penguins`WHEREbody_mass_gISNOTNULL;
It takes about 30 seconds to create the
penguins_modelmodel.To see the model, follow these steps:
In the left pane, clickExplorer:

If you don't see the left pane, clickExpand left pane to open the pane.
In theExplorer pane, expand your project and clickDatasets.
Click the
bqml_tutorialdataset.Click theModels tab.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
frombigframes.ml.linear_modelimportLinearRegressionimportbigframes.pandasasbpd# Load data from BigQuerybq_df=bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")# Drop rows with nulls to get training datatraining_data=bq_df.dropna(subset=["body_mass_g"])# Specify your feature (or input) columns and the label (or output) column:feature_columns=training_data.drop(columns=["body_mass_g"])label_columns=training_data[["body_mass_g"]]# Create the linear modelmodel=LinearRegression()model.fit(feature_columns,label_columns)model.to_gbq(your_model_id,# For example: "bqml_tutorial.penguins_model"replace=True,)It takes about 30 seconds to create the model. To see the model, follow thesesteps:
In the left pane, clickExplorer:

In theExplorer pane, expand your project and clickDatasets.
Click the
bqml_tutorialdataset.Click theModels tab.
Get training statistics
To see the results of the model training, you can use theML.TRAINING_INFO function,or you can view the statistics in the Google Cloud console. In thistutorial, you use the Google Cloud console.
A machine learning algorithm builds a model by examining many examples andattempting to find a model that minimizes loss. This process is called empiricalrisk minimization.
Loss is the penalty for a bad prediction. It is a number indicatinghow bad the model's prediction was on a single example. If the model'sprediction is perfect, the loss is zero; otherwise, the loss is greater. Thegoal of training a model is to find a set of weights and biases that have lowloss, on average, across all examples.
See the model training statistics that were generated when you ran theCREATE MODEL query:
In the left pane, clickExplorer:

In theExplorer pane, expand your project and clickDatasets.
Click the
bqml_tutorialdataset.Click theModels tab.
To open the model information pane, clickpenguins_model.
Click theTraining tab, and then clickTable. The results should looksimilar to the following:

TheTraining Data Loss column represents the loss metric calculatedafter the model is trained on the training dataset. Since you performed alinear regression, this column shows themean squared errorvalue. Anormal_equationoptimization strategy is automatically used for this training, so only oneiteration is required to converge to the final model. For more informationon setting the model optimization strategy, see
optimize_strategy.
Evaluate the model
After creating the model, evaluate the model's performance by using theML.EVALUATE function or thescore BigQuery DataFrames function to evaluate the predicted values generated by the model against the actual data.
SQL
For input, theML.EVALUATE function takes the trained model and a datasetthat matches the schema of the data that you used to train the model. Ina production environment, you shouldevaluate the model on different data than the data you used to train the model.If you runML.EVALUATE without providing input data, the function retrievesthe evaluation metrics calculated during training. These metrics are calculatedby using the automatically reserved evaluation dataset:
SELECT*FROMML.EVALUATE(MODELbqml_tutorial.penguins_model);Run theML.EVALUATE query:
In the Google Cloud console, go to theBigQuery page.
In the query editor, run the following query:
SELECT*FROMML.EVALUATE(MODEL`bqml_tutorial.penguins_model`,(SELECT*FROM`bigquery-public-data.ml_datasets.penguins`WHEREbody_mass_gISNOTNULL));
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
importbigframes.pandasasbpd# Select the model you will be evaluating. `read_gbq_model` loads model data from# BigQuery, but you could also use the `model` object from the previous steps.model=bpd.read_gbq_model(your_model_id,# For example: "bqml_tutorial.penguins_model")# Score the model with input data defined in an earlier step to compare# model predictions on feature_columns to true labels in label_columns.score=model.score(feature_columns,label_columns)# Expected output results:# index mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance# 0 227.012237 81838.159892 0.00507 173.080816 0.872377 0.872377# 1 rows x 6 columnsThe results should look similar to the following:

Because you performed a linear regression, the results include the followingcolumns:
mean_absolute_errormean_squared_errormean_squared_log_errormedian_absolute_errorr2_scoreexplained_variance
An important metric in the evaluation results is theR2 score.The R2 score is a statistical measure that determines if the linearregression predictions approximate the actual data. A value of0 indicatesthat the model explains none of the variability of the response data around themean. A value of1 indicates that the model explains all the variability ofthe response data around the mean.
You can also look at the model's information pane in the Google Cloud consoleto view the evaluation metrics:

Use the model to predict outcomes
Now that you have evaluated your model, the next step is to use it to predictan outcome. You can run theML.PREDICT function or thepredict BigQuery DataFrames functionon the model to predict the body mass in grams of all penguins that reside onthe Biscoe Islands.
SQL
For input, theML.PREDICT function takes the trained model and a dataset thatmatches the schema of the data that you used to train the model, excluding thelabel column.
Run theML.PREDICT query:
In the Google Cloud console, go to theBigQuery page.
In the query editor, run the following query:
SELECT*FROMML.PREDICT(MODEL`bqml_tutorial.penguins_model`,(SELECT*FROM`bigquery-public-data.ml_datasets.penguins`WHEREisland='Biscoe'));
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
# Select the model you'll use for predictions. `read_gbq_model` loads# model data from BigQuery, but you could also use the `model` object# object from previous steps.model=bpd.read_gbq_model(your_model_id,# For example: "bqml_tutorial.penguins_model",)# Load data from BigQuerybq_df=bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")# Use 'contains' function to filter by island containing the string# "Biscoe".biscoe_data=bq_df.loc[bq_df["island"].str.contains("Biscoe")]result=model.predict(biscoe_data)# Expected output results:# predicted_body_mass_g species island culmen_length_mm culmen_depth_mm body_mass_g flipper_length_mmsex# 23 4681.782896 Gentoo penguin (Pygoscelis papua)Biscoe <NA> <NA> <NA> <NA> <NA># 332 4740.7907 Gentoo penguin (Pygoscelis papua)Biscoe 46.2 14.4 214.0 4650.0 <NA># 160 4731.310452 Gentoo penguin (Pygoscelis papua)Biscoe 44.5 14.3 216.0 4100.0 <NA>The results should look similar to the following:

Explain the prediction results
SQL
To understand why the model is generating these prediction results, you can usetheML.EXPLAIN_PREDICT function.
ML.EXPLAIN_PREDICT is an extended version of theML.PREDICT function.ML.EXPLAIN_PREDICT not only outputs prediction results, but also outputsadditional columns to explain the prediction results. In practice, you can runML.EXPLAIN_PREDICT instead ofML.PREDICT. For more information, seeBigQuery ML explainable AI overview.
Run theML.EXPLAIN_PREDICT query:
- In the Google Cloud console, go to theBigQuery page.
- In the query editor, run the following query:
SELECT*FROMML.EXPLAIN_PREDICT(MODEL`bqml_tutorial.penguins_model`,(SELECT*FROM`bigquery-public-data.ml_datasets.penguins`WHEREisland='Biscoe'),STRUCT(3astop_k_features));
The results should look similar to the following:

ML.EXPLAIN_PREDICT query outputs all the input feature columns,similar to whatML.PREDICT does. For readability purposes, only one featurecolumn,species, is shown in the preceding figure.BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
# Use 'predict_explain' function to understand why the model is generating these prediction results.# 'predict_explain'is an extended version of the 'predict' function that not only outputs prediction results, but also outputs additional columns to explain the prediction results.# Using the trained model and utilizing data specific to Biscoe Island, explain the predictions of the top 3 featuresexplained=model.predict_explain(biscoe_data,top_k_features=3)# Expected results:# predicted_body_mass_g top_feature_attributions baseline_prediction_valueprediction_valueapproximation_error species islandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_g sex# 0 5413.510134 [{'feature': 'island', 'attribution': 7348.877...-5320.222128 5413.510134 0.0 Gentoo penguin (Pygoscelis papua)Biscoe 45.2 16.4 223.0 5950.0 MALE# 1 4768.351092 [{'feature': 'island', 'attribution': 7348.877...-5320.222128 4768.351092 0.0 Gentoo penguin (Pygoscelis papua)Biscoe 46.5 14.5 213.0 4400.0 FEMALE# 2 3235.896372 [{'feature': 'island', 'attribution': 7348.877...-5320.222128 3235.896372 0.0 Adelie Penguin (Pygoscelis adeliae)Biscoe 37.7 16.0 183.0 3075.0 FEMALE# 3 5349.603734 [{'feature': 'island', 'attribution': 7348.877...-5320.222128 5349.603734 0.0 Gentoo penguin (Pygoscelis papua)Biscoe 46.4 15.6 221.0 5000.0 MALE# 4 4637.165037 [{'feature': 'island', 'attribution': 7348.877...-5320.222128 4637.165037 0.0 Gentoo penguin (Pygoscelis papua)Biscoe 46.1 13.2 211.0 4500.0 FEMALEFor linear regression models, Shapley values are used to generate featureattribution values for each feature in the model. The output includesthe top three feature attributions per row of thepenguins table becausetop_k_features was set to3. These attributions are sorted bythe absolute value of the attribution in descending order. In all examples, thefeaturesex contributed the most to the overall prediction.
Globally explain the model
SQL
To know which features are generally the most important to determine penguinweight, you can use theML.GLOBAL_EXPLAIN function.In order to useML.GLOBAL_EXPLAIN, you must retrain the model with theENABLE_GLOBAL_EXPLAIN option set toTRUE.
Retrain and get global explanations for the model:
- In the Google Cloud console, go to theBigQuery page.
In the query editor, run the following query to retrain the model:
#standardSQLCREATEORREPLACEMODEL`bqml_tutorial.penguins_model`OPTIONS(model_type='linear_reg',input_label_cols=['body_mass_g'],enable_global_explain=TRUE)ASSELECT*FROM`bigquery-public-data.ml_datasets.penguins`WHEREbody_mass_gISNOTNULL;
In the query editor, run the following query to get global explanations:
SELECT*FROMML.GLOBAL_EXPLAIN(MODEL`bqml_tutorial.penguins_model`)
The results should look similar to the following:

BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
# To use the `global_explain()` function, the model must be recreated with `enable_global_explain` set to `True`.model=LinearRegression(enable_global_explain=True)# The model must the be fitted before it can be saved to BigQuery and then explained.training_data=bq_df.dropna(subset=["body_mass_g"])X=training_data.drop(columns=["body_mass_g"])y=training_data[["body_mass_g"]]model.fit(X,y)model.to_gbq("bqml_tutorial.penguins_model",replace=True)# Explain the modelexplain_model=model.global_explain()# Expected results:# attribution# feature# island 5737.315921# species 4073.280549# sex 622.070896# flipper_length_mm 193.612051# culmen_depth_mm 117.084944# culmen_length_mm 94.366793Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If youprefer to reuse the project, you can delete the dataset you created in thistutorial:
If necessary, open the BigQuery page in theGoogle Cloud console.
In the navigation, click thebqml_tutorial dataset you created.
ClickDelete dataset on the right side of the window.This action deletes the dataset, the table, and all the data.
In theDelete dataset dialog box, confirm the delete command by typingthe name of your dataset (
bqml_tutorial) and then clickDelete.
Delete your project
To delete the project:
What's next
- For an overview of BigQuery ML, seeIntroduction to BigQuery ML.
- For information on creating models, see the
CREATE MODELsyntax page.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-18 UTC.