Use ML and AI with BigQuery DataFrames
BigQuery DataFrames provides ML and AI capabilities forBigQuery DataFrames using thebigframes.ml library.
You canpreprocess data,create estimators to trainmodels in BigQuery DataFrames,create MLpipelines, andsplit training and testingdatasets.
Required roles
To get the permissions that you need to complete the tasks in this document, ask your administrator to grant you the following IAM roles on your project:
- Use remote models or AI functionalities:BigQuery Connection Admin (
roles/bigquery.connectionAdmin) - Use BigQuery DataFrames in a BigQuery notebook:
- BigQuery User (
roles/bigquery.user) - Notebook Runtime User (
roles/aiplatform.notebookRuntimeUser) - Code Creator (
roles/dataform.codeCreator)
- BigQuery User (
- Use default BigQuery connection:
- BigQuery Data Editor (
roles/bigquery.dataEditor) - BigQuery Connection Admin (
roles/bigquery.connectionAdmin) - Cloud Functions Developer (
roles/cloudfunctions.developer) - Service Account User (
roles/iam.serviceAccountUser) - Storage Object Viewer (
roles/storage.objectViewer)
- BigQuery Data Editor (
- Use BigQuery DataFrames ML remote models:BigQuery Connection Admin (
roles/bigquery.connectionAdmin)
For more information about granting roles, seeManage access to projects, folders, and organizations.
You might also be able to get the required permissions throughcustom roles or otherpredefined roles.
ML locations
Thebigframes.ml library supports the same locations asBigQuery ML. BigQuery ML model prediction and otherML functions are supported in all BigQuery regions. Support formodel training varies by region. For more information, seeBigQuery ML locations.
Preprocess data
Create transformers to prepare data for use in estimators (models) byusing thebigframes.ml.preprocessing moduleand thebigframes.ml.compose module.BigQuery DataFrames offers the following transformations:
To bin continuous data into intervals, use the
KBinsDiscretizerclassin thebigframes.ml.preprocessingmodule.To normalize the target labels as integer values, use the
LabelEncoderclassin thebigframes.ml.preprocessingmodule.To scale each feature to the range
[-1, 1]by its maximum absolute value,use theMaxAbsScalerclassin thebigframes.ml.preprocessingmodule.To standardize features by scaling each feature to the range
[0, 1],use theMinMaxScalerclassin thebigframes.ml.preprocessingmodule.To standardize features by removing the mean and scaling to unit variance,use the
StandardScalerclassin thebigframes.ml.preprocessingmodule.To transform categorical values into numeric format, use the
OneHotEncoderclassin thebigframes.ml.preprocessingmodule.To apply transformers to DataFrames columns, use the
ColumnTransformerclassin thebigframes.ml.composemodule.
Train models
You can create estimators to train models in BigQuery DataFrames.
Clustering models
You can create estimators for clustering models by using thebigframes.ml.cluster module.To create K-means clustering models, use theKMeans class. Use these models for datasegmentation. For example, identifying customer segments. K-means isan unsupervised learning technique, so model training doesn'trequire labels or split data for training or evaluation.
You can use thebigframes.ml.cluster module to create estimators forclustering models.
The following code sample shows using thebigframes.ml.cluster KMeansclass to create a k-means clustering model for data segmentation:
frombigframes.ml.clusterimportKMeansimportbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"bq_df=bpd.read_gbq(query_or_table)# Create the KMeans modelcluster_model=KMeans(n_clusters=10)cluster_model.fit(bq_df["culmen_length_mm"],bq_df["sex"])# Predict using the modelresult=cluster_model.predict(bq_df)# Score the modelscore=cluster_model.score(bq_df)Decomposition models
You can create estimators for decomposition models by using thebigframes.ml.decomposition module.To create principal component analysis (PCA) models, use thePCAclass. Use thesemodels for computing principal components and using them to performa change of basis on the data. Using thePCA class provides dimensionalityreduction by projecting each data point onto only the first fewprincipal components to obtain lower-dimensional data whilepreserving as much of the data's variation as possible.
Ensemble models
You can create estimators for ensemble models by using thebigframes.ml.ensemble module.
To create random forest classifier models, use the
RandomForestClassifierclass. Use these models forconstructing multiple learning method decision trees forclassification.To create random forest regression models, use the
RandomForestRegressorclass. Use these models forconstructing multiple learning method decision trees for regression.To create gradient boosted tree classifier models, use the
XGBClassifierclass. Use these modelsfor additively constructing multiple learning method decision treesfor classification.To create gradient boosted tree regression models, use the
XGBRegressorclass. Use these modelsfor additively constructing multiple learning method decision treesfor regression.
Forecasting models
You can create estimators for forecasting models by using thebigframes.ml.forecasting module.To create time series forecasting models, use theARIMAPlus class.
Imported models
You can create estimators for imported models by using thebigframes.ml.imported module.
To import Open Neural Network Exchange (ONNX) models, use the
ONNXModelclass.To import TensorFlow model, use the
TensorFlowModelclass.To import XGBoostModel models, use the
XGBoostModelclass.
Linear models
Create estimators for linear models by using thebigframes.ml.linear_model module.
To create linear regression models, use the
LinearRegressionclass. Use these models forforecasting, such as forecasting the sales of an item on agiven day.To create logistic regression models, use the
LogisticRegressionclass. Use these models for theclassification of two or more possible values such as whether aninput islow-value,medium-value, orhigh-value.
The following code sample shows usingbigframes.ml to do thefollowing:
- Load data from BigQuery.
- Clean and prepare training data.
- Create and apply abigframes.ml.LinearRegressionregression model.
frombigframes.ml.linear_modelimportLinearRegressionimportbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"bq_df=bpd.read_gbq(query_or_table)# Filter down to the data to the Adelie Penguin speciesadelie_data=bq_df[bq_df.species=="Adelie Penguin (Pygoscelis adeliae)"]# Drop the species columnadelie_data=adelie_data.drop(columns=["species"])# Drop rows with nulls to get training datatraining_data=adelie_data.dropna()# Specify your feature (or input) columns and the label (or output) column:feature_columns=training_data[["island","culmen_length_mm","culmen_depth_mm","flipper_length_mm","sex"]]label_columns=training_data[["body_mass_g"]]test_data=adelie_data[adelie_data.body_mass_g.isnull()]# Create the linear modelmodel=LinearRegression()model.fit(feature_columns,label_columns)# Score the modelscore=model.score(feature_columns,label_columns)# Predict using the modelresult=model.predict(test_data)Large language models
You can create estimators for LLMs by using thebigframes.ml.llm module.
To create Gemini text generator models, use the
GeminiTextGeneratorclass. Use these models for textgeneration tasks.To create estimators for remote large language models (LLMs), use the
bigframes.ml.llmmodule.
The following code sample shows using thebigframes.ml.llmGeminiTextGeneratorclass to create a Gemini model for code generation:
frombigframes.ml.llmimportGeminiTextGeneratorimportbigframes.pandasasbpd# Create the Gemini LLM modelsession=bpd.get_global_session()connection=f"{PROJECT_ID}.{REGION}.{CONN_NAME}"model=GeminiTextGenerator(session=session,connection_name=connection,model_name="gemini-2.0-flash-001")df_api=bpd.read_csv("gs://cloud-samples-data/vertex-ai/bigframe/df.csv")# Prepare the prompts and send them to the LLM model for predictiondf_prompt_prefix="Generate Pandas sample code for DataFrame."df_prompt=df_prompt_prefix+df_api["API"]# Predict using the modeldf_pred=model.predict(df_prompt.to_frame(),max_output_tokens=1024)Remote models
To use BigQuery DataFrames ML remote models (bigframes.ml.remoteorbigframes.ml.llm), you must enable the following APIs:
When you use BigQuery DataFrames ML remote models, you need theProject IAM Admin role (roles/resourcemanager.projectIamAdmin)if you use a default BigQuery connection, or theBrowser role (roles/browser)if you use a pre-configured connection. You can avoid this requirement bysetting thebigframes.pandas.options.bigquery.skip_bq_connection_check optiontoTrue, in which case the connection (default or pre-configured) is usedas-is without any existence or permission check. If you use thepre-configured connection and skip the connection check, verify thefollowing:
- The connection is created in the right location.
- If you use BigQuery DataFrames ML remote models, the serviceaccount has theVertex AI User role (
roles/aiplatform.user) on the project.
Creating a remote model in BigQuery DataFrames creates aBigQuery connection.By default, a connection of the namebigframes-default-connection is used. Youcan use a pre-configured BigQuery connection if you prefer,in which case the connection creation is skipped. The service accountfor the default connection is granted theVertex AI User role (roles/aiplatform.user) on the project.
Create pipelines
You can create ML pipelines by usingbigframes.ml.pipeline module.Pipelines let you assemble several ML steps to be cross-validated together whilesetting different parameters. This simplifies your code, and lets you deploydata preprocessing steps and an estimator together.
To create a pipeline of transforms with a final estimator, use thePipeline class.
Select models
To split your training and testing datasets and select the best models, use thebigframes.ml.model_selection modulemodule:
To split the data into training and testing (evaluation sets), as shown in thefollowing code sample, use the
train_test_splitfunction:X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)To create multi-fold training and testing sets to train and evaluate models,as shown in the following code sample, use the
KFoldclassand theKFold.splitmethod. This feature is valuable for smalldatasets.kf=KFold(n_splits=5)fori,(X_train,X_test,y_train,y_test)inenumerate(kf.split(X,y)):# Train and evaluate models with training and testing setsTo automatically create multi-fold training and testing sets, train andevaluate the model, and get the result of each fold, as shown in the followingcode sample, use the
cross_validatefunction:scores=cross_validate(model,X,y,cv=5)
What's next
- Learn about theBigQuery DataFrames data type system.
- Learn how togenerate BigQuery DataFrames code with Gemini.
- Learn how toanalyze package downloads from PyPI with BigQuery DataFrames.
- View BigQuery DataFramessource code,sample notebooks, andsampleson GitHub.
- Explore theBigQuery DataFrames API reference.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-18 UTC.