Perform classification with a boosted trees model Stay organized with collections Save and categorize content based on your preferences.
This tutorial teaches you how to use aboosted trees classifier modelto predict the income range of individuals based on their demographic data.The model predicts whether a value falls into one of two categories, in thiscase whether an individual's annual income falls above or below $50,000.
This tutorial uses thebigquery-public-data.ml_datasets.census_adult_incomedataset. This dataset contains the demographic and income information of USresidents from 2000 and 2010.
Objectives
This tutorial guides you through completing the following tasks:
- Creating a boosted trees model to predict census respondents' income bracketby using the
CREATE MODELstatement. - Evaluating the model by using the
ML.EVALUATEfunction. - Getting predictions from the model by using the
ML.PREDICTfunction.
Costs
This tutorial uses billable components of Google Cloud, including the following:
- BigQuery
- BigQuery ML
For more information about BigQuery costs, see theBigQuery pricing page.
For more information about BigQuery ML costs, seeBigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.
Required Permissions
To create the dataset, you need the
bigquery.datasets.createIAM permission.To create the model, you need the following permissions:
bigquery.jobs.createbigquery.models.createbigquery.models.getDatabigquery.models.updateData
To run inference, you need the following permissions:
bigquery.models.getDatabigquery.jobs.create
For more information about IAM roles and permissions inBigQuery, seeIntroduction to IAM.
Create a dataset
Create a BigQuery dataset to store your ML model.
Console
In the Google Cloud console, go to theBigQuery page.
In theExplorer pane, click your project name.
ClickView actions > Create dataset
On theCreate dataset page, do the following:
ForDataset ID, enter
bqml_tutorial.ForLocation type, selectMulti-region, and then selectUS (multiple regions in United States).
Leave the remaining default settings as they are, and clickCreate dataset.
bq
To create a new dataset, use thebq mk commandwith the--location flag. For a full list of possible parameters, see thebq mk --dataset commandreference.
Create a dataset named
bqml_tutorialwith the data location set toUSand a description ofBigQuery ML tutorial dataset:bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial
Instead of using the
--datasetflag, the command uses the-dshortcut.If you omit-dand--dataset, the command defaults to creating adataset.Confirm that the dataset was created:
bqls
API
Call thedatasets.insertmethod with a defineddataset resource.
{"datasetReference":{"datasetId":"bqml_tutorial"}}
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
importgoogle.cloud.bigquerybqclient=google.cloud.bigquery.Client()bqclient.create_dataset("bqml_tutorial",exists_ok=True)Prepare the sample data
The model you create in this tutorial predicts the income bracket for censusrespondents, based on the following features:
- Age
- Type of work performed
- Marital status
- Level of education
- Occupation
- Hours worked per week
Theeducation column isn't included in the training data, becausetheeducation andeducation_num columns both express the respondent's levelof education in different formats.
You separate the data into training, evaluation, and prediction sets by creatinga newdataframe column that is derived from thefunctional_weight column.Eighty percent of the data is used for training the model, and the remainingtwenty percent of the data is used for evaluation and prediction.
SQL
To prepare your sample data, create aview tocontain the training data. This view is used by theCREATE MODEL statementlater in this tutorial.
Run the query that prepares the sample data:
In the Google Cloud console, go to theBigQuery page.
In the query editor, run the following query:
CREATEORREPLACEVIEW`bqml_tutorial.input_data`ASSELECTage,workclass,marital_status,education_num,occupation,hours_per_week,income_bracket,CASEWHENMOD(functional_weight,10)<8THEN'training'WHENMOD(functional_weight,10)=8THEN'evaluation'WHENMOD(functional_weight,10)=9THEN'prediction'ENDASdataframeFROM`bigquery-public-data.ml_datasets.census_adult_income`;
In the left pane, clickExplorer:

If you don't see the left pane, clickExpand left pane to open the pane.
In theExplorer pane, search for the
bqml_tutorialdataset.Click the dataset, and then clickOverview> Tables.
Click the
input_dataview to open the information pane. The viewschema appears in theSchema tab.
BigQuery DataFrames
Create a DataFrame calledinput_data. You useinput_data later in this tutorial to use to train the model, evaluate it, and make predictions.
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
importbigframes.pandasasbpdinput_data=bpd.read_gbq("bigquery-public-data.ml_datasets.census_adult_income",columns=("age","workclass","marital_status","education_num","occupation","hours_per_week","income_bracket","functional_weight",),)input_data["dataframe"]=bpd.Series("training",index=input_data.index,).case_when([(((input_data["functional_weight"]%10)==8),"evaluation"),(((input_data["functional_weight"]%10)==9),"prediction"),])delinput_data["functional_weight"]Create the boosted trees model
Create a boosted trees model to predict census respondents' income bracket, andtrain it on the census data. The query takes about 30 minutes to complete.
SQL
Follow these steps to create the model:
In the Google Cloud console, go to theBigQuery page.
In the query editor, paste in the following query and clickRun:
CREATEMODEL`bqml_tutorial.tree_model`OPTIONS(MODEL_TYPE='BOOSTED_TREE_CLASSIFIER',BOOSTER_TYPE='GBTREE',NUM_PARALLEL_TREE=1,MAX_ITERATIONS=50,TREE_METHOD='HIST',EARLY_STOP=FALSE,SUBSAMPLE=0.85,INPUT_LABEL_COLS=['income_bracket'])ASSELECT*EXCEPT(dataframe)FROM`bqml_tutorial.input_data`WHEREdataframe='training';
After the query completes, the
tree_modelmodel can be accessed through theExplorer pane. Becausethe query uses aCREATE MODELstatement to create a model, you don't seequery results.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
frombigframes.mlimportensemble# input_data is defined in an earlier step.training_data=input_data[input_data["dataframe"]=="training"]X=training_data.drop(columns=["income_bracket","dataframe"])y=training_data["income_bracket"]# create and train the modeltree_model=ensemble.XGBClassifier(n_estimators=1,booster="gbtree",tree_method="hist",max_iterations=1,# For a more accurate model, try 50 iterations.subsample=0.85,)tree_model.fit(X,y)tree_model.to_gbq(your_model_id,# For example: "your-project.bqml_tutorial.tree_model"replace=True,)Evaluate the model
SQL
Follow these steps to evaluate the model:
In the Google Cloud console, go to theBigQuery page.
In the query editor, paste in the following query and clickRun:
SELECT*FROMML.EVALUATE(MODEL`bqml_tutorial.tree_model`,(SELECT*FROM`bqml_tutorial.input_data`WHEREdataframe='evaluation'));
The results should look similar to the following:
+---------------------+---------------------+---------------------+-------------------+---------------------+---------------------+| precision | recall | accuracy | f1_score | log_loss | roc_auc |+---------------------+---------------------+---------------------+-------------------+-------------------------------------------+| 0.67192429022082023 | 0.57880434782608692 | 0.83942963422194672 | 0.621897810218978 | 0.34405456040833338 | 0.88733566433566435 |+---------------------+---------------------+ --------------------+-------------------+---------------------+---------------------+
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
# Select model you'll use for predictions. `read_gbq_model` loads model# data from BigQuery, but you could also use the `tree_model` object# from the previous step.tree_model=bpd.read_gbq_model(your_model_id,# For example: "your-project.bqml_tutorial.tree_model")# input_data is defined in an earlier step.evaluation_data=input_data[input_data["dataframe"]=="evaluation"]X=evaluation_data.drop(columns=["income_bracket","dataframe"])y=evaluation_data["income_bracket"]# The score() method evaluates how the model performs compared to the# actual data. Output DataFrame matches that of ML.EVALUATE().score=tree_model.score(X,y)score.peek()# Output:# precision recall accuracy f1_score log_loss roc_auc# 0 0.671924 0.578804 0.839429 0.621897 0.344054 0.887335The evaluation metrics indicate good model performance, in particular,the fact that theroc_auc score is greater than0.8.
For more information about the evaluation metrics, seeOutput.
Use the model to predict classifications
SQL
Follow these steps to forecast data with the model:
In the Google Cloud console, go to theBigQuery page.
In the query editor, paste in the following query and clickRun:
SELECT*FROMML.PREDICT(MODEL`bqml_tutorial.tree_model`,(SELECT*FROM`bqml_tutorial.input_data`WHEREdataframe='prediction'));
The first few columns of the results should look similar to the following:
+---------------------------+--------------------------------------+-------------------------------------+ | predicted_income_bracket | predicted_income_bracket_probs.label | predicted_income_bracket_probs.prob | +---------------------------+--------------------------------------+-------------------------------------+ | <=50K | >50K | 0.05183430016040802 | +---------------------------+--------------------------------------+-------------------------------------+ | | <50K | 0.94816571474075317 | +---------------------------+--------------------------------------+-------------------------------------+ | <=50K | >50K | 0.00365859130397439 | +---------------------------+--------------------------------------+-------------------------------------+ | | <50K | 0.99634140729904175 | +---------------------------+--------------------------------------+-------------------------------------+ | <=50K | >50K | 0.037775970995426178 | +---------------------------+--------------------------------------+-------------------------------------+ | | <50K | 0.96222406625747681 | +---------------------------+--------------------------------------+-------------------------------------+
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.
# Select model you'll use for predictions. `read_gbq_model` loads model# data from BigQuery, but you could also use the `tree_model` object# from previous steps.tree_model=bpd.read_gbq_model(your_model_id,# For example: "your-project.bqml_tutorial.tree_model")# input_data is defined in an earlier step.prediction_data=input_data[input_data["dataframe"]=="prediction"]predictions=tree_model.predict(prediction_data)predictions.peek()# Output:# predicted_income_bracket predicted_income_bracket_probs.label predicted_income_bracket_probs.prob# <=50K >50K 0.05183430016040802# <50K 0.94816571474075317# <=50K >50K 0.00365859130397439# <50K 0.99634140729904175# <=50K >50K 0.037775970995426178# <50K 0.96222406625747681Thepredicted_income_bracket contains the predicted value from the model.Thepredicted_income_bracket_probs.label shows the two labels that themodel had to choose between, and thepredicted_income_bracket_probs.probcolumn shows the probability of the given label being thecorrect one.
For more information about the output columns, seeClassification models.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If youprefer to reuse the project, you can delete the dataset you created in thistutorial:
If necessary, open the BigQuery page in theGoogle Cloud console.
In the navigation, click thebqml_tutorial dataset you created.
ClickDelete dataset on the right side of the window.This action deletes the dataset, the table, and all the data.
In theDelete dataset dialog, confirm the delete command by typingthe name of your dataset (
bqml_tutorial) and then clickDelete.
Delete your project
To delete the project:
What's next
- Learn how tocreate a logistic regression classification model.
- For an overview of BigQuery ML, seeIntroduction to AI and ML in BigQuery.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.