Create a machine learning model in BigQuery ML by using SQL
This tutorial shows you how to create a logistic regression model by usingBigQuery ML SQL queries.
BigQuery ML lets you create and train machine learning models inBigQuery by using SQL queries. This helps make machine learning moreapproachable by letting you use familiar tools like the BigQuerySQL editor, and also increases development speed by removing the need to movedata into a separate machine learning environment.
In this tutorial, you use the sampleGoogle Analytics sample dataset for BigQueryto create a model that predicts whether a website visitor will make atransaction. For information on the schema of the Analytics dataset, seeBigQuery export schemain the Analytics Help Center.
To learn how to create models by using the Google Cloud console user interface, seework with models by using a UI.
Objectives
This tutorial shows you how to perform the following tasks:
- Using the
CREATE MODELstatementto create a binary logistic regression model. - Using the
ML.EVALUATEfunctionto evaluate the model. - Using the
ML.PREDICTfunctionto make predictions by using the model.
Costs
This tutorial uses billable components of Google Cloud,including the following:
- BigQuery
- BigQuery ML
For more information on BigQuery costs, see theBigQuery pricing page.
For more information on BigQuery ML costs, seeBigQuery ML pricing.
Required roles
To create a model and run inference, you must be granted the following roles:
- BigQuery Data Editor (
roles/bigquery.dataEditor) - BigQuery User (
roles/bigquery.user)
- BigQuery Data Editor (
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Make sure that you have the following role or roles on the project: BigQuery Data Editor, BigQuery Job User, Service Usage Admin
Check for the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
In thePrincipal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check theRole column to see whether the list of roles includes the required roles.
Grant the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
- ClickGrant access.
In theNew principals field, enter your user identifier. This is typically the email address for a Google Account.
- ClickSelect a role, then search for the role.
- To grant additional roles, clickAdd another role and add each additional role.
- ClickSave.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Make sure that you have the following role or roles on the project: BigQuery Data Editor, BigQuery Job User, Service Usage Admin
Check for the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
In thePrincipal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check theRole column to see whether the list of roles includes the required roles.
Grant the roles
In the Google Cloud console, go to theIAM page.
Go to IAM- Select the project.
- ClickGrant access.
In theNew principals field, enter your user identifier. This is typically the email address for a Google Account.
- ClickSelect a role, then search for the role.
- To grant additional roles, clickAdd another role and add each additional role.
- ClickSave.
- BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.
Create a dataset
Create a BigQuery dataset to store your ML model.
Console
In the Google Cloud console, go to theBigQuery page.
In theExplorer pane, click your project name.
ClickView actions > Create dataset
On theCreate dataset page, do the following:
ForDataset ID, enter
bqml_tutorial.ForLocation type, selectMulti-region, and then selectUS (multiple regions in United States).
Leave the remaining default settings as they are, and clickCreate dataset.
bq
To create a new dataset, use thebq mk commandwith the--location flag. For a full list of possible parameters, see thebq mk --dataset commandreference.
Create a dataset named
bqml_tutorialwith the data location set toUSand a description ofBigQuery ML tutorial dataset:bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial
Instead of using the
--datasetflag, the command uses the-dshortcut.If you omit-dand--dataset, the command defaults to creating adataset.Confirm that the dataset was created:
bqls
API
Call thedatasets.insertmethod with a defineddataset resource.
{"datasetReference":{"datasetId":"bqml_tutorial"}}
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
importgoogle.cloud.bigquerybqclient=google.cloud.bigquery.Client()bqclient.create_dataset("bqml_tutorial",exists_ok=True)Create a logistic regression model
Create a logistic regression model using the Analytics sampledataset for BigQuery.
SQL
In the Google Cloud console, go to theBigQuery page.
In the query editor, run the following statement:
CREATEORREPLACEMODEL`bqml_tutorial.sample_model`OPTIONS(model_type='logistic_reg')ASSELECTIF(totals.transactionsISNULL,0,1)ASlabel,IFNULL(device.operatingSystem,"")ASos,device.isMobileASis_mobile,IFNULL(geoNetwork.country,"")AScountry,IFNULL(totals.pageviews,0)ASpageviewsFROM`bigquery-public-data.google_analytics_sample.ga_sessions_*`WHERE_TABLE_SUFFIXBETWEEN'20160801'AND'20170630'
The query takes several minutes to complete. After the first iteration iscomplete, your model (
sample_model) appears in the navigation panel.Because the query uses aCREATE MODELstatement to create a model, you don'tsee query results.
Query details
TheCREATE MODEL statement creates the model and then trains the modelusing the data retrieved by your query'sSELECT statement.
TheOPTIONS(model_type='logistic_reg') clause creates alogisticregression model. Alogistic regression model splits input data into two classes, and thenestimates the probability that the data is in one of the classes. What you aretrying to detect, such as whether an email is spam, is represented by 1 andother values are represented by 0. The likelihood of a given value belongingto the class you are trying to detect is indicated by a value between 0 and 1.For example, if an email receives a probability estimate of 0.9, then there isa 90% probability that the email is spam.
This query'sSELECT statement retrieves the following columns that are usedby the model to predict the probability that a customer will complete atransaction:
totals.transactions: the total number of ecommerce transactionswithin the session. If the number of transactions isNULL, the value in thelabelcolumn is set to0. Otherwise, it is set to1. These valuesrepresent the possible outcomes. Creating an alias namedlabelis analternative to setting theinput_label_cols=option in theCREATE MODELstatement.device.operatingSystem: the operating system of the visitor's device.device.isMobile— Indicates whether the visitor's device is a mobiledevice.geoNetwork.country: the country from which the sessions originated,based on the IP address.totals.pageviews: the total number of page views within the session.
TheFROM clause — causes the query to train the model by using thebigquery-public-data.google_analytics_sample.ga_sessions sample tables.These tables are sharded by date, so you aggregate them by using a wildcard inthe table name:google_analytics_sample.ga_sessions_*.
TheWHERE clause —_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'— limits the number of tables scanned by the query. The date rangescanned is August 1, 2016 to June 30, 2017.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
frombigframes.ml.linear_modelimportLogisticRegressionimportbigframes.pandasasbpd# Start by selecting the data you'll use for training. `read_gbq` accepts# either a SQL query or a table ID. Since this example selects from multiple# tables via a wildcard, use SQL to define this data. Watch issue# https://github.com/googleapis/python-bigquery-dataframes/issues/169# for updates to `read_gbq` to support wildcard tables.df=bpd.read_gbq_table("bigquery-public-data.google_analytics_sample.ga_sessions_*",filters=[("_table_suffix",">=","20160801"),("_table_suffix","<=","20170630"),],)# Extract the total number of transactions within# the Google Analytics session.## Because the totals column is a STRUCT data type, call# Series.struct.field("transactions") to extract the transactions field.# See the reference documentation below:# https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor#bigframes_operations_structs_StructAccessor_fieldtransactions=df["totals"].struct.field("transactions")# The "label" values represent the outcome of the model's# prediction. In this case, the model predicts if there are any# ecommerce transactions within the Google Analytics session.# If the number of transactions is NULL, the value in the label# column is set to 0. Otherwise, it is set to 1.label=transactions.notnull().map({True:1,False:0}).rename("label")# Extract the operating system of the visitor's device.operating_system=df["device"].struct.field("operatingSystem")operating_system=operating_system.fillna("")# Extract whether the visitor's device is a mobile device.is_mobile=df["device"].struct.field("isMobile")# Extract the country from which the sessions originated, based on the IP address.country=df["geoNetwork"].struct.field("country").fillna("")# Extract the total number of page views within the session.pageviews=df["totals"].struct.field("pageviews").fillna(0)# Combine all the feature columns into a single DataFrame# to use as training data.features=bpd.DataFrame({"os":operating_system,"is_mobile":is_mobile,"country":country,"pageviews":pageviews,})# Logistic Regression model splits data into two classes, giving the# a confidence score that the data is in one of the classes.model=LogisticRegression()model.fit(features,label)# The model.fit() call above created a temporary model.# Use the to_gbq() method to write to a permanent location.model.to_gbq(your_model_id,# For example: "bqml_tutorial.sample_model",replace=True,)View the model's loss statistics
Machine learning is about creating a model that can use data to make aprediction. The model is essentially a function that takes inputs and appliescalculations to the inputs to produce an output — a prediction.
Machine learning algorithms work by taking several examples where theprediction is already known (such as the historical data of user purchases) anditeratively adjusting various weights in the model so that the model'spredictions match the true values. It does this by minimizing how wrong themodel is using a metric called loss.
The expectation is that for each iteration, the loss should be decreasing,ideally to zero. A loss of zero means the model is 100% accurate.
When training the model, BigQuery ML automatically splits theinput data intotraining and evaluation sets,in order to avoidoverfittingthe model. This is necessary so that the training algorithm doesn't fit itselfso closely to the training data that it can't generalize to new examples.
Use the Google Cloud console to see how the model's loss changes over themodel's training iterations:
In the Google Cloud console, go to theBigQuery page.
In the left pane, clickExplorer:

If you don't see the left pane, clickExpand left pane to open the pane.
In theExplorer pane, expand your project, clickDatasets,and then click the
bqml_tutorialdataset.Click theModels tab, and then click the
sample_modelmodel.Click theTraining tab and look at theLoss graph. TheLossgraph shows the change in the loss metric over the iterations on thetraining dataset. If you hold your cursor over the graph, you can see thatthere are lines forTraining loss andEvaluation loss. Since youperformed a logistic regression, the training loss value is calculated aslog loss, usingthe training data. The evaluation loss is the log loss calculated on theevaluation data. Both loss types represent average loss values, averagedover all examples in the respective datasets for each iteration.
You can also see the results of the model training by using theML.TRAINING_INFO function.
Evaluate the model
Evaluate the performance of the model by using theML.EVALUATE function. TheML.EVALUATE function evaluates the predicted values generated by the modelagainst the actual data. To calculate logistic regression specificmetrics, you can use theML.ROC_CURVE SQL functionor thebigframes.ml.metrics.roc_curve BigQuery DataFrames function.
In this tutorial, you are using a binary classification model thatdetects transactions. The values in thelabel column are the two classesgenerated by the model:0 (no transactions) and1 (transaction made).
SQL
In the Google Cloud console, go to theBigQuery page.
In the query editor, run the following statement:
SELECT*FROMML.EVALUATE(MODEL`bqml_tutorial.sample_model`,(SELECTIF(totals.transactionsISNULL,0,1)ASlabel,IFNULL(device.operatingSystem,"")ASos,device.isMobileASis_mobile,IFNULL(geoNetwork.country,"")AScountry,IFNULL(totals.pageviews,0)ASpageviewsFROM`bigquery-public-data.google_analytics_sample.ga_sessions_*`WHERE_TABLE_SUFFIXBETWEEN'20170701'AND'20170801'))
The results should look like the following:
+--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+ | precision | recall | accuracy | f1_score | log_loss | roc_auc | +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+ | 0.468503937007874 | 0.11080074487895716 | 0.98534315834767638 | 0.17921686746987953 | 0.04624221101176898 | 0.98174125874125873 | +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
Because you performed a logistic regression, the results include the following columns:
precision:a metric for classification models. Precision identifies the frequencywith which a model was correct when predicting the positive class.recall:a metric for classification models that answers the followingquestion: Out of all the possible positive labels, how many did the modelcorrectly identify?accuracy:accuracy is the fraction of predictions that a classification modelgot right.f1_score:a measure of the accuracy of the model. The f1 score is the harmonic average ofthe precision and recall. An f1 score's best value is 1. The worst value is 0.log_loss:the loss function used in a logistic regression. This is the measureof how far the model's predictions are from the correct labels.roc_auc:the area under theROCcurve. This is the probability that a classifier is more confident that arandomly chosen positive example is actually positive than that a randomlychosen negative example is positive. For more information, seeClassificationin the Machine Learning Crash Course.
Query details
The initialSELECT statement retrieves the columns from your model.
TheFROM clause uses theML.EVALUATE function against your model.
The nestedSELECT statement andFROM clause are the same as thosein theCREATE MODEL query.
TheWHERE clause —_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'— limits the number of tables scanned by the query. The date rangescanned is July 1, 2017 to August 1, 2017. This is the data you're using toevaluate the predictive performance of the model. It was collected in the monthimmediately following the time period spanned by the training data.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
importbigframes.pandasasbpd# Select model you'll use for evaluating. `read_gbq_model` loads model data from a# BigQuery, but you could also use the `model` object from the previous steps.model=bpd.read_gbq_model(your_model_id,# For example: "bqml_tutorial.sample_model",)# The filters parameter limits the number of tables scanned by the query.# The date range scanned is July 1, 2017 to August 1, 2017. This is the# data you're using to evaluate the predictive performance of the model.# It was collected in the month immediately following the time period# spanned by the training data.df=bpd.read_gbq_table("bigquery-public-data.google_analytics_sample.ga_sessions_*",filters=[("_table_suffix",">=","20170701"),("_table_suffix","<=","20170801"),],)transactions=df["totals"].struct.field("transactions")label=transactions.notnull().map({True:1,False:0}).rename("label")operating_system=df["device"].struct.field("operatingSystem")operating_system=operating_system.fillna("")is_mobile=df["device"].struct.field("isMobile")country=df["geoNetwork"].struct.field("country").fillna("")pageviews=df["totals"].struct.field("pageviews").fillna(0)features=bpd.DataFrame({"os":operating_system,"is_mobile":is_mobile,"country":country,"pageviews":pageviews,})# Some models include a convenient .score(X, y) method for evaluation with a preset accuracy metric:# Because you performed a logistic regression, the results include the following columns:# - precision — A metric for classification models. Precision identifies the frequency with# which a model was correct when predicting the positive class.# - recall — A metric for classification models that answers the following question:# Out of all the possible positive labels, how many did the model correctly identify?# - accuracy — Accuracy is the fraction of predictions that a classification model got right.# - f1_score — A measure of the accuracy of the model. The f1 score is the harmonic average of# the precision and recall. An f1 score's best value is 1. The worst value is 0.# - log_loss — The loss function used in a logistic regression. This is the measure of how far the# model's predictions are from the correct labels.# - roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that# a randomly chosen positive example# is actually positive than that a randomly chosen negative example is positive. For more information,# see ['Classification']('https://developers.google.com/machine-learning/crash-course/classification/video-lecture')# in the Machine Learning Crash Course.model.score(features,label)# precision recall accuracy f1_score log_loss roc_auc# 0 0.412621 0.079143 0.985074 0.132812 0.049764 0.974285# [1 rows x 6 columns]Use the model to predict outcomes
Use the model to predict the number of transactions made by website visitorsfrom each country.
SQL
In the Google Cloud console, go to theBigQuery page.
In the query editor, run the following statement:
SELECTcountry,SUM(predicted_label)astotal_predicted_purchasesFROMML.PREDICT(MODEL`bqml_tutorial.sample_model`,(SELECTIFNULL(device.operatingSystem,"")ASos,device.isMobileASis_mobile,IFNULL(totals.pageviews,0)ASpageviews,IFNULL(geoNetwork.country,"")AScountryFROM`bigquery-public-data.google_analytics_sample.ga_sessions_*`WHERE_TABLE_SUFFIXBETWEEN'20170701'AND'20170801'))GROUPBYcountryORDERBYtotal_predicted_purchasesDESCLIMIT10
The results should look like the following:
+----------------+---------------------------+| country | total_predicted_purchases |+----------------+---------------------------+| United States | 220 || Taiwan | 8 || Canada | 7 || India | 2 || Turkey | 2 || Japan | 2 || Italy | 1 || Brazil | 1 || Singapore | 1 || Australia | 1 |+----------------+---------------------------+
Query details
The initialSELECT statement retrieves thecountry column and sums thepredicted_label column. Thepredicted_label column is generated by theML.PREDICT function. When you use theML.PREDICT function, the outputcolumn name for the model ispredicted_<label_column_name>. For linearregression models,predicted_label is the estimated value oflabel.For logistic regression models,predicted_label is the label that bestdescribes the given input data value, either0 or1.
TheML.PREDICT function is used to predict results using your model.
The nestedSELECT statement andFROM clause are the same as thosein theCREATE MODEL query.
TheWHERE clause —_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'— limits the number of tables scanned by the query. The date rangescanned is July 1, 2017 to August 1, 2017. This is the data for which you'remaking predictions. It was collected in the month immediately following the timeperiod spanned by the training data.
TheGROUP BY andORDER BY clauses group the results by country and orderthem by the sum of the predicted purchases in descending order.
TheLIMIT clause is used here to display only the top 10 results.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
importbigframes.pandasasbpd# Select model you'll use for predicting.# `read_gbq_model` loads model data from# BigQuery, but you could also use the `model`# object from the previous steps.model=bpd.read_gbq_model(your_model_id,# For example: "bqml_tutorial.sample_model",)# The filters parameter limits the number of tables scanned by the query.# The date range scanned is July 1, 2017 to August 1, 2017. This is the# data you're using to make the prediction.# It was collected in the month immediately following the time period# spanned by the training data.df=bpd.read_gbq_table("bigquery-public-data.google_analytics_sample.ga_sessions_*",filters=[("_table_suffix",">=","20170701"),("_table_suffix","<=","20170801"),],)operating_system=df["device"].struct.field("operatingSystem")operating_system=operating_system.fillna("")is_mobile=df["device"].struct.field("isMobile")country=df["geoNetwork"].struct.field("country").fillna("")pageviews=df["totals"].struct.field("pageviews").fillna(0)features=bpd.DataFrame({"os":operating_system,"is_mobile":is_mobile,"country":country,"pageviews":pageviews,})# Use Logistic Regression predict method to predict results# using your model.# Find more information here in# [BigFrames](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.linear_model.LogisticRegression#bigframes_ml_linear_model_LogisticRegression_predict)predictions=model.predict(features)# Call groupby method to group predicted_label by country.# Call sum method to get the total_predicted_label by country.total_predicted_purchases=predictions.groupby(["country"])[["predicted_label"]].sum()# Call the sort_values method with the parameter# ascending = False to get the highest values.# Call head method to limit to the 10 highest values.total_predicted_purchases.sort_values(ascending=False).head(10)# country# United States 220# Taiwan 8# Canada 7# India 2# Japan 2# Turkey 2# Australia 1# Brazil 1# Germany 1# Guyana 1# Name: predicted_label, dtype: Int64Predict purchases per user
Predict the number of transactions each website visitor will make.
SQL
This query is identical to the query in the previous section except for theGROUP BY clause. Here theGROUP BY clause —GROUP BY fullVisitorId— is used to group the results by visitor ID.
In the Google Cloud console, go to theBigQuery page.
In the query editor, run the following statement:
SELECTfullVisitorId,SUM(predicted_label)astotal_predicted_purchasesFROMML.PREDICT(MODEL`bqml_tutorial.sample_model`,(SELECTIFNULL(device.operatingSystem,"")ASos,device.isMobileASis_mobile,IFNULL(totals.pageviews,0)ASpageviews,IFNULL(geoNetwork.country,"")AScountry,fullVisitorIdFROM`bigquery-public-data.google_analytics_sample.ga_sessions_*`WHERE_TABLE_SUFFIXBETWEEN'20170701'AND'20170801'))GROUPBYfullVisitorIdORDERBYtotal_predicted_purchasesDESCLIMIT10
The results should look like the following:
+---------------------+---------------------------+ | fullVisitorId | total_predicted_purchases | +---------------------+---------------------------+ | 9417857471295131045 | 4 | | 112288330928895942 | 2 | | 2158257269735455737 | 2 | | 489038402765684003 | 2 | | 057693500927581077 | 2 | | 2969418676126258798 | 2 | | 5073919761051630191 | 2 | | 7420300501523012460 | 2 | | 0456807427403774085 | 2 | | 2105122376016897629 | 2 | +---------------------+---------------------------+
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
importbigframes.pandasasbpd# Select model you'll use for predicting.# `read_gbq_model` loads model data from# BigQuery, but you could also use the `model`# object from the previous steps.model=bpd.read_gbq_model(your_model_id,# For example: "bqml_tutorial.sample_model",)# The filters parameter limits the number of tables scanned by the query.# The date range scanned is July 1, 2017 to August 1, 2017. This is the# data you're using to make the prediction.# It was collected in the month immediately following the time period# spanned by the training data.df=bpd.read_gbq_table("bigquery-public-data.google_analytics_sample.ga_sessions_*",filters=[("_table_suffix",">=","20170701"),("_table_suffix","<=","20170801"),],)operating_system=df["device"].struct.field("operatingSystem")operating_system=operating_system.fillna("")is_mobile=df["device"].struct.field("isMobile")country=df["geoNetwork"].struct.field("country").fillna("")pageviews=df["totals"].struct.field("pageviews").fillna(0)full_visitor_id=df["fullVisitorId"]features=bpd.DataFrame({"os":operating_system,"is_mobile":is_mobile,"country":country,"pageviews":pageviews,"fullVisitorId":full_visitor_id,})predictions=model.predict(features)# Call groupby method to group predicted_label by visitor.# Call sum method to get the total_predicted_label by visitor.total_predicted_purchases=predictions.groupby(["fullVisitorId"])[["predicted_label"]].sum()# Call the sort_values method with the parameter# ascending = False to get the highest values.# Call head method to limit to the 10 highest values.total_predicted_purchases.sort_values(ascending=False).head(10)# fullVisitorId# 9417857471295131045 4# 0376394056092189113 2# 0456807427403774085 2# 057693500927581077 2# 112288330928895942 2# 1280993661204347450 2# 2105122376016897629 2# 2158257269735455737 2# 2969418676126258798 2# 489038402765684003 2# Name: predicted_label, dtype: Int64Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
You can delete the project you created, or keep the project and delete thedataset.
Delete the dataset
Deleting your project removes all datasets and all tables in the project. If youprefer to reuse the project, you can delete the dataset you created in thistutorial:
In the Google Cloud console, go to theBigQuery page.
In the left pane, clickExplorer:

In theExplorer pane, expand your project, clickDatasets,and then click the
bqml_tutorialdataset that you created.ClickDelete.
In theDelete dataset dialog, confirm the delete command by typing
delete.ClickDelete.
Delete the project
To delete the project:
What's next
- To learn more about machine learning, see theMachine learning crash course.
- For an overview of BigQuery ML, seeIntroduction to BigQuery ML.
- To learn more about the Google Cloud console, seeUsing the Google Cloud console.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.