Create a k-means model to cluster London bicycle hires dataset

This tutorial teaches you how to use ak-means modelin BigQuery ML to identify clusters in a set of data.

Thek-meansalgorithm that groups your data into clusters is a form of unsupervisedmachine learning. Unlike supervised machine learning, which is about predictiveanalytics, unsupervised machine learning is about descriptive analytics.Unsupervised machine learning can help you understand your data so that you canmake data-driven decisions.

The queries in this tutorial usegeography functionsavailable in geospatial analytics. For more information, seeIntroduction to geospatial analytics.

This tutorial uses theLondon Bicycle Hires public dataset. The dataincludes start and stop timestamps, station names, and ride duration.

Objectives

This tutorial guides you through completing the following tasks:

Examine the data used to train the model.
Create a k-means clustering model.
Interpret the data clusters produced, using BigQuery ML'svisualization of the clusters.
Run theML.PREDICT functionon the k-means model to predict the likely cluster for a set of bike hirestations.

Costs

This tutorial uses billable components of Google Cloud, including the following:

BigQuery
BigQuery ML

For information on BigQuery costs, see theBigQuery pricing page.

For information on BigQuery ML costs, seeBigQuery ML pricing.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.
Enable the API

Required Permissions

To create the dataset, you need thebigquery.datasets.createIAM permission.
To create the model, you need the following permissions:
- bigquery.jobs.create
- bigquery.models.create
- bigquery.models.getData
- bigquery.models.updateData
To run inference, you need the following permissions:
- bigquery.models.getData
- bigquery.jobs.create

For more information about IAM roles and permissions inBigQuery, seeIntroduction to IAM.

Create a dataset

Create a BigQuery dataset to store your k-means model:

In the Google Cloud console, go to the BigQuery page.
Go to the BigQuery page
In the left pane, clickExplorer:
If you don't see the left pane, clickExpand left pane to open the pane.
In theExplorer pane, click your project name.
ClickView actions > Create dataset.
On theCreate dataset page, do the following:
- ForDataset ID, enterbqml_tutorial.
- ForLocation type, selectMulti-region, and then selectEU (multiple regions in European Union).
  The London Bicycle Hires public dataset is stored in theEUmulti-region. Your dataset mustbe in the same location.
- Leave the remaining default settings as they are, and clickCreate dataset.

Examine the training data

Examine the data you will use to train your k-means model. In thistutorial, you cluster bike stations based on the following attributes:

Duration of rentals
Number of trips per day
Distance from city center

SQL

This query extracts data on cycle hires, including thestart_station_nameandduration columns, and joins this data with station information. Thisincludes creating a calculated column that contains the station distancefrom the city center. Then, it computes attributes ofthe station in astationstats column, including the average duration ofrides and the number of trips, and the calculateddistance_from_city_centercolumn.

Follow these steps to examine the training data:

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery

In the query editor, paste in the following query and clickRun:

WITHhsAS(SELECTh.start_station_nameASstation_name,IF(EXTRACT(DAYOFWEEKFROMh.start_date)=1OREXTRACT(DAYOFWEEKFROMh.start_date)=7,'weekend','weekday')ASisweekday,h.duration,ST_DISTANCE(ST_GEOGPOINT(s.longitude,s.latitude),ST_GEOGPOINT(-0.1,51.5))/1000ASdistance_from_city_centerFROM`bigquery-public-data.london_bicycles.cycle_hire`AShJOIN`bigquery-public-data.london_bicycles.cycle_stations`ASsONh.start_station_id=s.idWHEREh.start_dateBETWEENCAST('2015-01-01 00:00:00'ASTIMESTAMP)ANDCAST('2016-01-01 00:00:00'ASTIMESTAMP)),stationstatsAS(SELECTstation_name,isweekday,AVG(duration)ASduration,COUNT(duration)ASnum_trips,MAX(distance_from_city_center)ASdistance_from_city_centerFROMhsGROUPBYstation_name,isweekday)SELECT*FROMstationstatsORDERBYdistance_from_city_centerASC;

The results should look similar to the following:

Query results

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

importdatetimeimporttypingimportpandasaspdfromshapely.geometryimportPointimportbigframesimportbigframes.bigqueryasbbqimportbigframes.geopandasimportbigframes.pandasasbpdbigframes.options.bigquery.project=your_gcp_project_id# Compute in the EU multi-region to query the London bicycles dataset.bigframes.options.bigquery.location="EU"# Extract the information you'll need to train the k-means model in this# tutorial. Use the read_gbq function to represent cycle hires# data as a DataFrame.h=bpd.read_gbq("bigquery-public-data.london_bicycles.cycle_hire",col_order=["start_station_name","start_station_id","start_date","duration"],).rename(columns={"start_station_name":"station_name","start_station_id":"station_id",})# Use GeoSeries.from_xy and BigQuery.st_distance to analyze geographical# data. These functions determine spatial relationships between# geographical features.cycle_stations=bpd.read_gbq("bigquery-public-data.london_bicycles.cycle_stations")s=bpd.DataFrame({"id":cycle_stations["id"],"xy":bigframes.geopandas.GeoSeries.from_xy(cycle_stations["longitude"],cycle_stations["latitude"]),})s_distance=bbq.st_distance(s["xy"],Point(-0.1,51.5),use_spheroid=False)/1000s=bpd.DataFrame({"id":s["id"],"distance_from_city_center":s_distance})# Define Python datetime objects in the UTC timezone for range comparison,# because BigQuery stores timestamp data in the UTC timezone.sample_time=datetime.datetime(2015,1,1,0,0,0,tzinfo=datetime.timezone.utc)sample_time2=datetime.datetime(2016,1,1,0,0,0,tzinfo=datetime.timezone.utc)h=h.loc[(h["start_date"] >=sample_time) &(h["start_date"] <=sample_time2)]# Replace each day-of-the-week number with the corresponding "weekday" or# "weekend" label by using the Series.map method.h=h.assign(isweekday=h.start_date.dt.dayofweek.map({0:"weekday",1:"weekday",2:"weekday",3:"weekday",4:"weekday",5:"weekend",6:"weekend",}))# Supplement each trip in "h" with the station distance information from# "s" by merging the two DataFrames by station ID.merged_df=h.merge(right=s,how="inner",left_on="station_id",right_on="id",)# Engineer features to cluster the stations. For each station, find the# average trip duration, number of trips, and distance from city center.stationstats=typing.cast(bpd.DataFrame,merged_df.groupby(["station_name","isweekday"]).agg({"duration":["mean","count"],"distance_from_city_center":"max"}),)stationstats.columns=pd.Index(["duration","num_trips","distance_from_city_center"])stationstats=stationstats.sort_values(by="distance_from_city_center",ascending=True).reset_index()# Expected output results: >>> stationstats.head(3)# station_nameisweekday duration  num_tripsdistance_from_city_center# Borough Road...weekday    1110    5749    0.12624# Borough Road...weekend    2125    1774    0.12624# Webber Street...weekday    795        6517    0.164021#   3 rows × 5 columns

Create a k-means model

Create a k-means model using London Bicycle Hires training data.

SQL

In the following query, theCREATE MODEL statement specifies the number ofclusters to use — four. In theSELECT statement, theEXCEPT clauseexcludes thestation_name column because this column doesn't contain afeature. The query creates a unique row per station_name, and only thefeatures are mentioned in theSELECT statement.

Follow these steps to create a k-means model:

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery

In the query editor, paste in the following query and clickRun:

CREATEORREPLACEMODEL`bqml_tutorial.london_station_clusters`OPTIONS(model_type='kmeans',num_clusters=4)ASWITHhsAS(SELECTh.start_station_nameASstation_name,IF(EXTRACT(DAYOFWEEKFROMh.start_date)=1OREXTRACT(DAYOFWEEKFROMh.start_date)=7,'weekend','weekday')ASisweekday,h.duration,ST_DISTANCE(ST_GEOGPOINT(s.longitude,s.latitude),ST_GEOGPOINT(-0.1,51.5))/1000ASdistance_from_city_centerFROM`bigquery-public-data.london_bicycles.cycle_hire`AShJOIN`bigquery-public-data.london_bicycles.cycle_stations`ASsONh.start_station_id=s.idWHEREh.start_dateBETWEENCAST('2015-01-01 00:00:00'ASTIMESTAMP)ANDCAST('2016-01-01 00:00:00'ASTIMESTAMP)),stationstatsAS(SELECTstation_name,isweekday,AVG(duration)ASduration,COUNT(duration)ASnum_trips,MAX(distance_from_city_center)ASdistance_from_city_centerFROMhsGROUPBYstation_name,isweekday)SELECT*EXCEPT(station_name,isweekday)FROMstationstats;

BigQuery DataFrames

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

frombigframes.ml.clusterimportKMeans# To determine an optimal number of clusters, construct and fit several# K-Means objects with different values of num_clusters, find the error# measure, and pick the point at which the error measure is at its minimum# value.cluster_model=KMeans(n_clusters=4)cluster_model.fit(stationstats)cluster_model.to_gbq(your_model_id,# For example: "bqml_tutorial.london_station_clusters"replace=True,)

Interpret the data clusters

The information in the models'sEvaluation tab can help you to interpretthe clusters produced by the model.

Follow these steps to view the model's evaluation information:

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery
In the left pane, clickExplorer:
In theExplorer pane, expand your project and clickDatasets.
Click thebqml_tutorial dataset, and then go to theModels tab.
Select thelondon_station_clusters model.
Select theEvaluation tab. This tab displays visualizations of theclusters identified by the k-means model. In theNumeric featuressection, bar graphs display the most important numeric feature values foreach centroid. Each centroid represents a given cluster of data. You canselect which features to visualize from the drop-down menu.
This model creates the following centroids:
- Centroid 1 shows a less busy city station, with shorter duration rentals.
- Centroid 2 shows the second city station which is less busy and used forlonger duration rentals.
- Centroid 3 shows a busy city station that is close to the city center.
- Centroid 4 shows a suburban station with trips that are longer.
If you were running the bicycle hire business, you could use this informationto inform business decisions. For example:
- Assume that you need to experiment with a new type of lock. Which cluster ofstations should you choose as a subject for this experiment? The stations incentroid 1, centroid 2 or centroid 4 seem like logical choices because theyare not the busiest stations.
- Assume that you want to stock some stations with racing bikes. Which stationsshould you choose? Centroid 4 is the group of stations that are far from thecity center, and they have the longest trips. These are likely candidates forracing bikes.

Use the`ML.PREDICT` function to predict a station's cluster

Identify the cluster to which a particular station belongs by using theML.PREDICT SQL function or thepredict BigQuery DataFrames function.

SQL

The following query uses theREGEXP_CONTAINSfunction to find all entries in thestation_name column that contain thestringKennington. TheML.PREDICT function uses those values to predictwhich clusters might contain those stations.

Follow these steps to predicts the cluster of every station that hasthe stringKennington in its name:

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery

In the query editor, paste in the following query and clickRun:

WITHhsAS(SELECTh.start_station_nameASstation_name,IF(EXTRACT(DAYOFWEEKFROMh.start_date)=1OREXTRACT(DAYOFWEEKFROMh.start_date)=7,'weekend','weekday')ASisweekday,h.duration,ST_DISTANCE(ST_GEOGPOINT(s.longitude,s.latitude),ST_GEOGPOINT(-0.1,51.5))/1000ASdistance_from_city_centerFROM`bigquery-public-data.london_bicycles.cycle_hire`AShJOIN`bigquery-public-data.london_bicycles.cycle_stations`ASsONh.start_station_id=s.idWHEREh.start_dateBETWEENCAST('2015-01-01 00:00:00'ASTIMESTAMP)ANDCAST('2016-01-01 00:00:00'ASTIMESTAMP)),stationstatsAS(SELECTstation_name,isweekday,AVG(duration)ASduration,COUNT(duration)ASnum_trips,MAX(distance_from_city_center)ASdistance_from_city_centerFROMhsGROUPBYstation_name,isweekday)SELECT*EXCEPT(nearest_centroids_distance)FROMML.PREDICT(MODEL`bqml_tutorial.london_station_clusters`,(SELECT*FROMstationstatsWHEREREGEXP_CONTAINS(station_name,'Kennington')));

The results should look similar to the following.

ML.PREDICT results

BigQuery DataFrames

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

# Select model you'll use for predictions. `read_gbq_model` loads model# data from BigQuery, but you could also use the `cluster_model` object# from previous steps.cluster_model=bpd.read_gbq_model(your_model_id,# For example: "bqml_tutorial.london_station_clusters",)# Use 'contains' function to filter by stations containing the string# "Kennington".stationstats=stationstats.loc[stationstats["station_name"].str.contains("Kennington")]result=cluster_model.predict(stationstats)# Expected output results:   >>>results.peek(3)# CENTROID...NEAREST...station_name  isweekday duration num_trips dist...# 1[{'CENTROID_ID'...Borough...  weekday  1110    57490.13# 2[{'CENTROID_ID'...Borough...  weekend  2125      17740.13# 1[{'CENTROID_ID'...Webber...  weekday  795    65170.16#   3 rows × 7 columns

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

You can delete the project you created.
Or you can keep the project and delete the dataset.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If youprefer to reuse the project, you can delete the dataset you created in thistutorial:

If necessary, open the BigQuery page in theGoogle Cloud console.
Go to the BigQuery page
In the navigation, click thebqml_tutorial dataset you created.
ClickDelete dataset on the right side of the window.This action deletes the dataset and the model.
In theDelete dataset dialog, confirm the delete command by typingthe name of your dataset (bqml_tutorial) and then clickDelete.

Delete your project

To delete the project:

Caution: Deleting a project has the following effects:

Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

In the Google Cloud console, go to theManage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then clickDelete.
In the dialog, type the project ID, and then clickShut down to delete the project.

What's next

For an overview of BigQuery ML, seeIntroduction to BigQuery ML.
For information about creating models, see theCREATE MODELsyntax page.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Create a k-means model to cluster London bicycle hires dataset Stay organized with collections Save and categorize content based on your preferences.

Objectives

Costs

Before you begin

Required Permissions

Create a dataset

Examine the training data

SQL

BigQuery DataFrames

Create a k-means model

SQL

BigQuery DataFrames

Interpret the data clusters

Use theML.PREDICT function to predict a station's cluster

SQL

BigQuery DataFrames

Clean up

Delete your dataset

Delete your project

What's next

Create a k-means model to cluster London bicycle hires dataset

Use the`ML.PREDICT` function to predict a station's cluster