Use ML and AI with BigQuery DataFrames

BigQuery DataFrames provides ML and AI capabilities forBigQuery DataFrames using thebigframes.ml library.

You canpreprocess data,create estimators to trainmodels in BigQuery DataFrames,create MLpipelines, andsplit training and testingdatasets.

Required roles

To get the permissions that you need to complete the tasks in this document, ask your administrator to grant you the following IAM roles on your project:

Use remote models or AI functionalities:BigQuery Connection Admin (roles/bigquery.connectionAdmin)
Use BigQuery DataFrames in a BigQuery notebook:
- BigQuery User (roles/bigquery.user)
- Notebook Runtime User (roles/aiplatform.notebookRuntimeUser)
- Code Creator (roles/dataform.codeCreator)
Use default BigQuery connection:
- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery Connection Admin (roles/bigquery.connectionAdmin)
- Cloud Functions Developer (roles/cloudfunctions.developer)
- Service Account User (roles/iam.serviceAccountUser)
- Storage Object Viewer (roles/storage.objectViewer)
Use BigQuery DataFrames ML remote models:BigQuery Connection Admin (roles/bigquery.connectionAdmin)

For more information about granting roles, seeManage access to projects, folders, and organizations.

You might also be able to get the required permissions throughcustom roles or otherpredefined roles.

ML locations

Thebigframes.ml library supports the same locations asBigQuery ML. BigQuery ML model prediction and otherML functions are supported in all BigQuery regions. Support formodel training varies by region. For more information, seeBigQuery ML locations.

Preprocess data

Create transformers to prepare data for use in estimators (models) byusing thebigframes.ml.preprocessing moduleand thebigframes.ml.compose module.BigQuery DataFrames offers the following transformations:

To bin continuous data into intervals, use theKBinsDiscretizer classin thebigframes.ml.preprocessing module.
To normalize the target labels as integer values, use theLabelEncoder classin thebigframes.ml.preprocessing module.
To scale each feature to the range[-1, 1] by its maximum absolute value,use theMaxAbsScaler classin thebigframes.ml.preprocessing module.
To standardize features by scaling each feature to the range[0, 1],use theMinMaxScaler classin thebigframes.ml.preprocessing module.
To standardize features by removing the mean and scaling to unit variance,use theStandardScaler classin thebigframes.ml.preprocessing module.
To transform categorical values into numeric format, use theOneHotEncoder classin thebigframes.ml.preprocessing module.
To apply transformers to DataFrames columns, use theColumnTransformer classin thebigframes.ml.compose module.

Train models

You can create estimators to train models in BigQuery DataFrames.

Clustering models

You can create estimators for clustering models by using thebigframes.ml.cluster module.To create K-means clustering models, use theKMeans class. Use these models for datasegmentation. For example, identifying customer segments. K-means isan unsupervised learning technique, so model training doesn'trequire labels or split data for training or evaluation.

You can use thebigframes.ml.cluster module to create estimators forclustering models.

The following code sample shows using thebigframes.ml.cluster KMeansclass to create a k-means clustering model for data segmentation:

frombigframes.ml.clusterimportKMeansimportbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"bq_df=bpd.read_gbq(query_or_table)# Create the KMeans modelcluster_model=KMeans(n_clusters=10)cluster_model.fit(bq_df["culmen_length_mm"],bq_df["sex"])# Predict using the modelresult=cluster_model.predict(bq_df)# Score the modelscore=cluster_model.score(bq_df)

Decomposition models

You can create estimators for decomposition models by using thebigframes.ml.decomposition module.To create principal component analysis (PCA) models, use thePCAclass. Use thesemodels for computing principal components and using them to performa change of basis on the data. Using thePCA class provides dimensionalityreduction by projecting each data point onto only the first fewprincipal components to obtain lower-dimensional data whilepreserving as much of the data's variation as possible.

Ensemble models

You can create estimators for ensemble models by using thebigframes.ml.ensemble module.

To create random forest classifier models, use theRandomForestClassifier class. Use these models forconstructing multiple learning method decision trees forclassification.
To create random forest regression models, use theRandomForestRegressor class. Use these models forconstructing multiple learning method decision trees for regression.
To create gradient boosted tree classifier models, use theXGBClassifier class. Use these modelsfor additively constructing multiple learning method decision treesfor classification.
To create gradient boosted tree regression models, use theXGBRegressor class. Use these modelsfor additively constructing multiple learning method decision treesfor regression.

Forecasting models

You can create estimators for forecasting models by using thebigframes.ml.forecasting module.To create time series forecasting models, use theARIMAPlus class.

Imported models

You can create estimators for imported models by using thebigframes.ml.imported module.

To import Open Neural Network Exchange (ONNX) models, use theONNXModel class.
To import TensorFlow model, use theTensorFlowModel class.
To import XGBoostModel models, use theXGBoostModel class.

Linear models

Create estimators for linear models by using thebigframes.ml.linear_model module.

To create linear regression models, use theLinearRegression class. Use these models forforecasting, such as forecasting the sales of an item on agiven day.
To create logistic regression models, use theLogisticRegression class. Use these models for theclassification of two or more possible values such as whether aninput islow-value,medium-value, orhigh-value.

The following code sample shows usingbigframes.ml to do thefollowing:

Load data from BigQuery.
Clean and prepare training data.
Create and apply abigframes.ml.LinearRegressionregression model.

frombigframes.ml.linear_modelimportLinearRegressionimportbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"bq_df=bpd.read_gbq(query_or_table)# Filter down to the data to the Adelie Penguin speciesadelie_data=bq_df[bq_df.species=="Adelie Penguin (Pygoscelis adeliae)"]# Drop the species columnadelie_data=adelie_data.drop(columns=["species"])# Drop rows with nulls to get training datatraining_data=adelie_data.dropna()# Specify your feature (or input) columns and the label (or output) column:feature_columns=training_data[["island","culmen_length_mm","culmen_depth_mm","flipper_length_mm","sex"]]label_columns=training_data[["body_mass_g"]]test_data=adelie_data[adelie_data.body_mass_g.isnull()]# Create the linear modelmodel=LinearRegression()model.fit(feature_columns,label_columns)# Score the modelscore=model.score(feature_columns,label_columns)# Predict using the modelresult=model.predict(test_data)

Large language models

You can create estimators for LLMs by using thebigframes.ml.llm module.

To create Gemini text generator models, use theGeminiTextGenerator class. Use these models for textgeneration tasks.
To create estimators for remote large language models (LLMs), use thebigframes.ml.llmmodule.

The following code sample shows using thebigframes.ml.llmGeminiTextGeneratorclass to create a Gemini model for code generation:

frombigframes.ml.llmimportGeminiTextGeneratorimportbigframes.pandasasbpd# Create the Gemini LLM modelsession=bpd.get_global_session()connection=f"{PROJECT_ID}.{REGION}.{CONN_NAME}"model=GeminiTextGenerator(session=session,connection_name=connection,model_name="gemini-2.0-flash-001")df_api=bpd.read_csv("gs://cloud-samples-data/vertex-ai/bigframe/df.csv")# Prepare the prompts and send them to the LLM model for predictiondf_prompt_prefix="Generate Pandas sample code for DataFrame."df_prompt=df_prompt_prefix+df_api["API"]# Predict using the modeldf_pred=model.predict(df_prompt.to_frame(),max_output_tokens=1024)

Remote models

To use BigQuery DataFrames ML remote models (bigframes.ml.remoteorbigframes.ml.llm), you must enable the following APIs:

When you use BigQuery DataFrames ML remote models, you need theProject IAM Admin role (roles/resourcemanager.projectIamAdmin)if you use a default BigQuery connection, or theBrowser role (roles/browser)if you use a pre-configured connection. You can avoid this requirement bysetting thebigframes.pandas.options.bigquery.skip_bq_connection_check optiontoTrue, in which case the connection (default or pre-configured) is usedas-is without any existence or permission check. If you use thepre-configured connection and skip the connection check, verify thefollowing:

The connection is created in the right location.
If you use BigQuery DataFrames ML remote models, the serviceaccount has theVertex AI User role (roles/aiplatform.user) on the project.

Creating a remote model in BigQuery DataFrames creates aBigQuery connection.By default, a connection of the namebigframes-default-connection is used. Youcan use a pre-configured BigQuery connection if you prefer,in which case the connection creation is skipped. The service accountfor the default connection is granted theVertex AI User role (roles/aiplatform.user) on the project.

Create pipelines

You can create ML pipelines by usingbigframes.ml.pipeline module.Pipelines let you assemble several ML steps to be cross-validated together whilesetting different parameters. This simplifies your code, and lets you deploydata preprocessing steps and an estimator together.

To create a pipeline of transforms with a final estimator, use thePipeline class.

Select models

To split your training and testing datasets and select the best models, use thebigframes.ml.model_selection modulemodule:

To split the data into training and testing (evaluation sets), as shown in thefollowing code sample, use thetrain_test_split function:
```
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
```
To create multi-fold training and testing sets to train and evaluate models,as shown in the following code sample, use theKFold classand theKFold.split method. This feature is valuable for smalldatasets.
```
kf=KFold(n_splits=5)fori,(X_train,X_test,y_train,y_test)inenumerate(kf.split(X,y)):# Train and evaluate models with training and testing sets
```
To automatically create multi-fold training and testing sets, train andevaluate the model, and get the result of each fold, as shown in the followingcode sample, use thecross_validate function:
```
scores=cross_validate(model,X,y,cv=5)
```

What's next

Learn about theBigQuery DataFrames data type system.
Learn how togenerate BigQuery DataFrames code with Gemini.
Learn how toanalyze package downloads from PyPI with BigQuery DataFrames.
View BigQuery DataFramessource code,sample notebooks, andsampleson GitHub.
Explore theBigQuery DataFrames API reference.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.

Movatterモバイル変換