Create recommendations based on explicit feedback with a matrix factorization model

This tutorial teaches you how to create amatrix factorization modeland train it on the customer movie ratings in themovielens1m dataset. You thenuse the matrix factorization model to generate movie recommendations for users.

Using customer-provided ratings to train the model is calledtraining withexplicit feedback. Matrix factorization models are trainedusing theAlternating Least Squares algorithm when you useexplicit feedback as training data.

Important: You must have a reservation in order to use a matrix factorizationmodel. For more information, seePricing.

Objectives

This tutorial guides you through completing the following tasks:

  • Creating a matrix factorization model by using theCREATE MODEL statement.
  • Evaluating the model by using theML.EVALUATE function.
  • Generating movie recommendations for users by using the model with theML.RECOMMEND function.

Costs

This tutorial uses billable components of Google Cloud,including the following:

  • BigQuery
  • BigQuery ML

For more information on BigQuery costs, see theBigQuery pricing page.

For more information on BigQuery ML costs, seeBigQuery ML pricing.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to

    Enable the BigQuery API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the API

Required Permissions

  • To create the dataset, you need thebigquery.datasets.createIAM permission.

  • To create the model, you need the following permissions:

    • bigquery.jobs.create
    • bigquery.models.create
    • bigquery.models.getData
    • bigquery.models.updateData
  • To run inference, you need the following permissions:

    • bigquery.models.getData
    • bigquery.jobs.create

For more information about IAM roles and permissions inBigQuery, seeIntroduction to IAM.

Create a dataset

Create a BigQuery dataset to store your ML model.

Console

  1. In the Google Cloud console, go to theBigQuery page.

    Go to the BigQuery page

  2. In theExplorer pane, click your project name.

  3. ClickView actions > Create dataset

  4. On theCreate dataset page, do the following:

    • ForDataset ID, enterbqml_tutorial.

    • ForLocation type, selectMulti-region, and then selectUS (multiple regions in United States).

    • Leave the remaining default settings as they are, and clickCreate dataset.

bq

To create a new dataset, use thebq mk commandwith the--location flag. For a full list of possible parameters, see thebq mk --dataset commandreference.

  1. Create a dataset namedbqml_tutorial with the data location set toUSand a description ofBigQuery ML tutorial dataset:

    bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial

    Instead of using the--dataset flag, the command uses the-d shortcut.If you omit-d and--dataset, the command defaults to creating adataset.

  2. Confirm that the dataset was created:

    bqls

API

Call thedatasets.insertmethod with a defineddataset resource.

{"datasetReference":{"datasetId":"bqml_tutorial"}}

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

importgoogle.cloud.bigquerybqclient=google.cloud.bigquery.Client()bqclient.create_dataset("bqml_tutorial",exists_ok=True)

Upload the Movielens data

Upload themovielens1m data into BigQuery.

CLI

Follow these steps to upload themovielens1m data using thebq command-line tool:

  1. Open Cloud Shell:

    Activate Cloud Shell

  2. Upload the ratings data into theratings table. On the command line, pastein the following query and hitEnter:

    curl-O'http://files.grouplens.org/datasets/movielens/ml-1m.zip'unzipml-1m.zipsed's/::/,/g'ml-1m/ratings.dat >ratings.csvbqload--source_format=CSVbqml_tutorial.ratingsratings.csv\user_id:INT64,item_id:INT64,rating:FLOAT64,timestamp:TIMESTAMP
  3. Upload the movie data into themovies table. On the command line,paste in the following query and hitEnter:

    sed's/::/@/g'ml-1m/movies.dat >movie_titles.csvbqload--source_format=CSV--field_delimiter=@\bqml_tutorial.moviesmovie_titles.csv\movie_id:INT64,movie_title:STRING,genre:STRING

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

First, create aClient object withbqclient = google.cloud.bigquery.Client(), then load themovielens1m datainto the dataset you created in the previous step.

importioimportzipfileimportgoogle.api_core.exceptionsimportrequeststry:# Check if you've already created the Movielens tables to avoid downloading# and uploading the dataset unnecessarily.bqclient.get_table("bqml_tutorial.ratings")bqclient.get_table("bqml_tutorial.movies")exceptgoogle.api_core.exceptions.NotFound:# Download the https://grouplens.org/datasets/movielens/1m/ dataset.ml1m=requests.get("http://files.grouplens.org/datasets/movielens/ml-1m.zip")ml1m_file=io.BytesIO(ml1m.content)ml1m_zip=zipfile.ZipFile(ml1m_file)# Upload the ratings data into the ratings table.withml1m_zip.open("ml-1m/ratings.dat")asratings_file:ratings_content=ratings_file.read()ratings_csv=io.BytesIO(ratings_content.replace(b"::",b","))ratings_config=google.cloud.bigquery.LoadJobConfig()ratings_config.source_format="CSV"ratings_config.write_disposition="WRITE_TRUNCATE"ratings_config.schema=[google.cloud.bigquery.SchemaField("user_id","INT64"),google.cloud.bigquery.SchemaField("item_id","INT64"),google.cloud.bigquery.SchemaField("rating","FLOAT64"),google.cloud.bigquery.SchemaField("timestamp","TIMESTAMP"),]bqclient.load_table_from_file(ratings_csv,"bqml_tutorial.ratings",job_config=ratings_config).result()# Upload the movie data into the movies table.withml1m_zip.open("ml-1m/movies.dat")asmovies_file:movies_content=movies_file.read()movies_csv=io.BytesIO(movies_content.replace(b"::",b"@"))movies_config=google.cloud.bigquery.LoadJobConfig()movies_config.source_format="CSV"movies_config.field_delimiter="@"movies_config.write_disposition="WRITE_TRUNCATE"movies_config.schema=[google.cloud.bigquery.SchemaField("movie_id","INT64"),google.cloud.bigquery.SchemaField("movie_title","STRING"),google.cloud.bigquery.SchemaField("genre","STRING"),]bqclient.load_table_from_file(movies_csv,"bqml_tutorial.movies",job_config=movies_config).result()

Create the model

Create a matrix factorization model and train it on the data in theratingstable. The model is trained to predict a rating for every user-item pair,based on the customer-provided movie ratings.

SQL

The followingCREATE MODEL statement uses these columns to generaterecommendations:

  • user_id—The user ID.
  • item_id—The movie ID.
  • rating—The explicit rating from 1 to 5 that the user gave theitem.

Follow these steps to create the model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    CREATEORREPLACEMODEL`bqml_tutorial.mf_explicit`OPTIONS(MODEL_TYPE='matrix_factorization',FEEDBACK_TYPE='explicit',USER_COL='user_id',ITEM_COL='item_id',L2_REG=9.83,NUM_FACTORS=34)ASSELECTuser_id,item_id,ratingFROM`bqml_tutorial.ratings`;

    The query takes about 10 minutes to complete, after which themf_explicit model appears in theExplorer pane. Becausethe query uses aCREATE MODEL statement to create a model, you don't seequery results.

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

frombigframes.mlimportdecompositionimportbigframes.pandasasbpd# Load data from BigQuerybq_df=bpd.read_gbq("bqml_tutorial.ratings",columns=("user_id","item_id","rating"))# Create the Matrix Factorization modelmodel=decomposition.MatrixFactorization(num_factors=34,feedback_type="explicit",user_col="user_id",item_col="item_id",rating_col="rating",l2_reg=9.83,)model.fit(bq_df)model.to_gbq(your_model_id,replace=True# For example: "bqml_tutorial.mf_explicit")

The code takes about 10 minutes to complete, after which themf_explicit model appears in theExplorer pane.

Get training statistics

Optionally, you can view the model's training statistics in theGoogle Cloud console.

A machine learning algorithm builds a model by creating many iterations ofthe model using different parameters, and then selecting the version of themodel that minimizesloss.This process is called empirical risk minimization. The model's trainingstatistics let you see the loss associated with each iteration of the model.

Follow these steps to view the model's training statistics:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

    If you don't see the left pane, clickExpand left pane to open the pane.

  3. In theExplorer pane, expand your project, clickDatasets, and thenclick thebqml_tutorial dataset.

  4. Click theModels tab.

  5. Click themf_explicit model and then click theTraining tab

  6. In theView as section, clickTable. The results should looksimilar to the following:

    +-----------+--------------------+--------------------+| Iteration | Training Data Loss | Duration (seconds) |+-----------+--------------------+--------------------+|  11       | 0.3943             | 42.59              |+-----------+--------------------+--------------------+|  10       | 0.3979             | 27.37              |+-----------+--------------------+--------------------+|   9       | 0.4038             | 40.79              |+-----------+--------------------+--------------------+|  ...      | ...                | ...                |+-----------+--------------------+--------------------+

    TheTraining Data Loss column represents the loss metric calculatedafter the model is trained. Because this is a matrix factorization model,this column shows themean squared error.

You can also use theML.TRAINING_INFO functionto see model training statistics.

Evaluate the model

Evaluate the performance of the model by comparing the predicted movie ratingsreturned by the model against the actual user movie ratings from the trainingdata.

SQL

Use theML.EVALUATE function to evaluate the model:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    SELECT*FROMML.EVALUATE(MODEL`bqml_tutorial.mf_explicit`,(SELECTuser_id,item_id,ratingFROM`bqml_tutorial.ratings`));

    The results should look similar to the following:

    +---------------------+---------------------+------------------------+-----------------------+--------------------+--------------------+| mean_absolute_error | mean_squared_error  | mean_squared_log_error | median_absolute_error |      r2_score      | explained_variance |+---------------------+---------------------+------------------------+-----------------------+--------------------+--------------------+| 0.48494444327829156 | 0.39433706592870565 |   0.025437895793637522 |   0.39017059802629905 | 0.6840033369412044 | 0.6840033369412264 |+---------------------+---------------------+------------------------+-----------------------+--------------------+--------------------+

    An important metric in the evaluation results is theR2score.The R2 score is a statistical measure that determines if thelinear regression predictions approximate the actual data. A value of0indicates that the model explains none of the variability of theresponse data around the mean. A value of1 indicates that the modelexplains all the variability of the response data around the mean.

    For more information about theML.EVALUATE function output, seeOutput.

You can also callML.EVALUATE without providing the input data. It willuse the evaluation metrics calculated during training.

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

Callmodel.score()to evaluate the model.

# Evaluate the model using the score() functionmodel.score(bq_df)# Output:# mean_absolute_errormean_squared_errormean_squared_log_errormedian_absolute_errorr2_scoreexplained_variance# 0.485403                0.395052        0.025515            0.390573        0.68343        0.68343

Get the predicted ratings for a subset of user-item pairs

Get the predicted rating for each movie for five users.

SQL

Use theML.RECOMMEND function to get predicted ratings:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, paste in the following query and clickRun:

    SELECT*FROMML.RECOMMEND(MODEL`bqml_tutorial.mf_explicit`,(SELECTuser_idFROM`bqml_tutorial.ratings`LIMIT5));

    The results should look similar to the following:

    +--------------------+---------+---------+| predicted_rating   | user_id | item_id |+--------------------+---------+---------+| 4.2125303962491873 | 4       | 3169    |+--------------------+---------+---------+| 4.8068920531981263 | 4       | 3739    |+--------------------+---------+---------+| 3.8742203494732403 | 4       | 3574    |+--------------------+---------+---------+| ...                | ...     | ...     |+--------------------+---------+---------+

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

Callmodel.predict()to get predicted ratings.

# Use predict() to get the predicted rating for each movie for 5 userssubset=bq_df[["user_id"]].head(5)predicted=model.predict(subset)print(predicted)# Output:#   predicted_ratinguser_id item_idrating# 0    4.206146     4354  968     4.0# 1    4.853099     3622  3521     5.0# 2    2.679067     5543  920     2.0# 3    4.323458     445  3175     5.0# 4    3.476911     5535  235     4.0

Generate recommendations

Use the predicted ratings to generate the top five recommended movies foreach user.

SQL

Follow these steps to generate recommendations:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. Write the predicted ratings to a table. In the query editor, paste in thefollowing query and clickRun:

    CREATEORREPLACETABLE`bqml_tutorial.recommend`ASSELECT*FROMML.RECOMMEND(MODEL`bqml_tutorial.mf_explicit`);
  3. Join the predicted ratings with the movie information, and select the topfive results per user. In the query editor, paste in thefollowing query and clickRun:

SELECTuser_id,ARRAY_AGG(STRUCT(movie_title,genre,predicted_rating)ORDERBYpredicted_ratingDESCLIMIT5)FROM(SELECTuser_id,item_id,predicted_rating,movie_title,genreFROM`bqml_tutorial.recommend`JOIN`bqml_tutorial.movies`ONitem_id=movie_id)GROUPBYuser_id;

The results should look similar to the following:

  +---------+-------------------------------------+------------------------+--------------------+  | user_id | f0_movie_title                      | f0_genre               | predicted_rating   |  +---------+-------------------------------------+------------------------+--------------------+  | 4597    | Song of Freedom (1936)              | Drama                  | 6.8495752907364009 |  |         | I Went Down (1997)                  | Action/Comedy/Crime    | 6.7203235758772877 |  |         | Men With Guns (1997)                | Action/Drama           | 6.399407352232001  |  |         | Kid, The (1921)                     | Action                 | 6.1952890198126731 |  |         | Hype! (1996)                        | Documentary            | 6.1895766097451475 |  +---------+-------------------------------------+------------------------+--------------------+  | 5349    | Fandango (1985)                     | Comedy                 | 9.944574012151549  |  |         | Breakfast of Champions (1999)       | Comedy                 | 9.55661860430112   |  |         | Funny Bones (1995)                  | Comedy                 | 9.52778917835076   |  |         | Paradise Road (1997)                | Drama/War              | 9.1643621767929133 |  |         | Surviving Picasso (1996)            | Drama                  | 8.807353289233772  |  +---------+-------------------------------------+------------------------+--------------------+  | ...     | ...                                 | ...                    | ...                |  +---------+-------------------------------------+------------------------+--------------------+

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in theBigQuery quickstart using BigQuery DataFrames. For more information, see theBigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up ADC for a local development environment.

Callmodel.predict()to get predicted ratings.

# import bigframes.bigquery as bbq# Load moviesmovies=bpd.read_gbq("bqml_tutorial.movies")# Merge the movies df with the previously created predicted dfmerged_df=bpd.merge(predicted,movies,left_on="item_id",right_on="movie_id")# Separate users and predicted data, setting the index to 'movie_id'users=merged_df[["user_id","movie_id"]].set_index("movie_id")# Take the predicted data and sort it in descending order by 'predicted_rating', setting the index to 'movie_id'sort_data=(merged_df[["movie_title","genre","predicted_rating","movie_id"]].sort_values(by="predicted_rating",ascending=False).set_index("movie_id"))# re-merge the separated dfs by indexmerged_user=sort_data.join(users,how="outer")# group the users and set the user_id as the indexmerged_user.groupby("user_id").head(5).set_index("user_id").sort_index()print(merged_user)# Output:#             movie_title                genre        predicted_rating# user_id#   1    Saving Private Ryan (1998)Action|Drama|War    5.19326#   1        Fargo (1996)       Crime|Drama|Thriller    4.996954#   1    Driving Miss Daisy (1989)    Drama            4.983671#   1        Ben-Hur (1959)       Action|Adventure|Drama4.877622#   1     Schindler's List (1993)   Drama|War        4.802336#   2    Saving Private Ryan (1998)Action|Drama|War    5.19326#   2        Braveheart (1995)    Action|Drama|War    5.174145#   2        Gladiator (2000)      Action|Drama        5.066372#   2        On Golden Pond (1981)     Drama            5.01198#   2    Driving Miss Daisy (1989)     Drama            4.983671

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If youprefer to reuse the project, you can delete the dataset you created in thistutorial:

  1. If necessary, open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the navigation, click thebqml_tutorial dataset you created.

  3. ClickDelete dataset on the right side of the window.This action deletes the dataset, the table, and all the data.

  4. In theDelete dataset dialog, confirm the delete command by typingthe name of your dataset (bqml_tutorial) and then clickDelete.

Delete your project

To delete the project:

    Caution: Deleting a project has the following effects:
    • Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
    • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as anappspot.com URL, delete selected resources inside the project instead of deleting the whole project.

    If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

  1. In the Google Cloud console, go to theManage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then clickDelete.
  3. In the dialog, type the project ID, and then clickShut down to delete the project.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.