Use BigQuery DataFrames

BigQuery DataFrames provides a Pythonic DataFrame and machinelearning (ML) API powered by the BigQuery engine.BigQuery DataFrames is an open-source package. You can runpip install --upgrade bigframes to install the latest version.

BigQuery DataFrames provides three libraries:

  • bigframes.pandasprovides a pandas API that you can use to analyze and manipulate data inBigQuery. Many workloads can be migrated from pandas tobigframes by just changing a few imports. Thebigframes.pandas API isscalable to support processing terabytes of BigQuery data, andthe API uses the BigQuery query engine to perform calculations.
  • bigframes.bigqueryprovides many BigQuery SQL functions that might not have apandas equivalent such assql_scalar,which can call any scalar BigQuery function orvector_searchto find similarvectors.
  • bigframes.mlprovides an API similar to the scikit-learn API for ML. The ML capabilitiesin BigQuery DataFrames let you preprocess data, and then train modelson that data. You can also chain these actions together to create datapipelines.

Required roles

To get the permissions that you need to complete the tasks in this document, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, seeManage access to projects, folders, and organizations.

You might also be able to get the required permissions throughcustom roles or otherpredefined roles.

In addition, when using BigQuery DataFrames remote functions orBigQuery DataFrames ML remote models, you need theProject IAM Admin role (roles/resourcemanager.projectIamAdmin)if you're using a default BigQuery connection, or theBrowser role (roles/browser)if you're using a pre-configured connection. This requirement can be avoided bysetting thebigframes.pandas.options.bigquery.skip_bq_connection_check optiontoTrue, in which case the connection (default or pre-configured) is usedas-is without any existence or permission check. If you're using thepre-configured connection and skipping the connection check, verify thefollowing:

  • The connection is created in the right location.
  • If you're using BigQuery DataFrames remote functions, the serviceaccount has theCloud Run Invoker role (roles/run.invoker) on the project.
  • If you're using BigQuery DataFrames ML remote models, the serviceaccount has theVertex AI User role (roles/aiplatform.user) on the project.

When you're performing end user authentication in an interactiveenvironment like a notebook, Python REPL, or the command line,BigQuery DataFrames prompts for authentication, if needed.Otherwise, seehow to set up application default credentialsfor various environments.

Configure installation options

After you install BigQuery DataFrames, you can specify the followingoptions.

Location and project

You need to specify thelocationandprojectin which you want to use BigQuery DataFrames.

You can define the location and project in your notebook in the following way:

importbigframes.pandasasbpdPROJECT_ID="bigframes-dev"# @param {type:"string"}REGION="US"# @param {type:"string"}# Set BigQuery DataFrames options# Note: The project option is not required in all environments.# On BigQuery Studio, the project ID is automatically detected.bpd.options.bigquery.project=PROJECT_ID# Note: The location option is not required.# It defaults to the location of the first table or query# passed to read_gbq(). For APIs where a location can't be# auto-detected, the location defaults to the "US" location.bpd.options.bigquery.location=REGION

Data processing location

BigQuery DataFrames is designed for scale, which itachieves by keeping data and processing on the BigQueryservice. However, you can bring data into the memory of your clientmachine by calling.to_pandas() on a DataFrame orSeries object. Ifyou choose to do this, the memory limitation of your client machineapplies.

Migrate to BigQuery DataFrames version 2.0

Version 2.0 of BigQuery DataFrames makes security and performance improvementsto the BigQuery DataFrames API, adds new features, and introducesbreaking changes. This document describes the changes and provides migrationguidance. You can apply these recommendations before installing the 2.0 versionby using the latest version 1.x of BigQuery DataFrames.

BigQuery DataFrames version 2.0 has the following benefits:

  • Faster queries and fewer tables are created when you run queries that returnresults to the client, becauseallow_large_results defaults toFalse. Thiscan reduce storage costs, especially if you use physical bytes billing.
  • Improved security by default in the remote functions deployed byBigQuery DataFrames.

Install BigQuery DataFrames version 2.0

To avoid breaking changes, pin to a specific version ofBigQuery DataFrames in yourrequirements.txt file (for example,bigframes==1.42.0) or yourpyproject.toml file (for example,dependencies = ["bigframes = 1.42.0"]). When you're ready to try the latestversion, you can runpip install --upgrade bigframes to install the latestversion of BigQuery DataFrames.

Use theallow_large_results option

BigQuery has amaximum response size limit for query jobs.Starting in BigQuery DataFrames version 2.0, BigQuery DataFramesenforces this limit by default in methods that return results to the client,such aspeek(),to_pandas(), andto_pandas_batches(). If your job returnslarge results, you can setallow_large_results toTrue in yourBigQueryOptions object to avoid breaking changes. This option is set toFalse by default in BigQuery DataFrames version 2.0.

importbigframes.pandasasbpdbpd.options.bigquery.allow_large_results=True

You can override theallow_large_results option by using theallow_large_results parameter into_pandas() and other methods. For example:

bf_df=bpd.read_gbq(query)# ... other operations on bf_df ...pandas_df=bf_df.to_pandas(allow_large_results=True)

Use the@remote_function decorator

BigQuery DataFrames version 2.0 makes some changes to the defaultbehavior of the@remote_function decorator.

Keyword arguments are enforced for ambiguous parameters

To prevent passing values to an unintended parameter,BigQuery DataFrames version 2.0 and beyond enforces the use of keywordarguments for the following parameters:

  • bigquery_connection
  • reuse
  • name
  • packages
  • cloud_function_service_account
  • cloud_function_kms_key_name
  • cloud_function_docker_repository
  • max_batching_rows
  • cloud_function_timeout
  • cloud_function_max_instances
  • cloud_function_vpc_connector
  • cloud_function_memory_mib
  • cloud_function_ingress_settings

When using these parameters, supply the parameter name. For example:

@remote_function(name="my_remote_function",...)defmy_remote_function(parameter:int)->str:returnstr(parameter)

Set a service account

As of version 2.0, BigQuery DataFrames no longer uses theCompute Engine service account by default for the Cloud Run functionsit deploys. To limit the permissions of the function that you deploy,

  1. Create a service accountwith minimal permissions.
  2. Supply the service account email to thecloud_function_service_accountparameter of the@remote_function decorator.

For example:

@remote_function(cloud_function_service_account="my-service-account@my-project.iam.gserviceaccount.com",...)defmy_remote_function(parameter:int)->str:returnstr(parameter)

If you would like to use the Compute Engine service account, you can set thecloud_function_service_account parameter of the@remote_function decoratorto"default". For example:

# This usage is discouraged. Use only if you have a specific reason to use the# default Compute Engine service account.@remote_function(cloud_function_service_account="default",...)defmy_remote_function(parameter:int)->str:returnstr(parameter)

Set ingress settings

As of version 2.0, BigQuery DataFrames sets theingress settings of the Cloud Run functions itdeploys to"internal-only". Previously, the ingress settings were set to"all" by default. You can change the ingress settings by setting thecloud_function_ingress_settings parameter of the@remote_function decorator.For example:

@remote_function(cloud_function_ingress_settings="internal-and-gclb",...)defmy_remote_function(parameter:int)->str:returnstr(parameter)

Use custom endpoints

In BigQuery DataFrames versions earlier than 2.0, if a region didn'tsupportregional service endpoints andbigframes.pandas.options.bigquery.use_regional_endpoints = True, thenBigQuery DataFrames would fall back tolocational endpoints. Version 2.0 ofBigQuery DataFrames removes this fallback behavior. To connect tolocational endpoints in version 2.0, set thebigframes.pandas.options.bigquery.client_endpoints_override option. Forexample:

importbigframes.pandasasbpdbpd.options.bigquery.client_endpoints_override={"bqclient":"https://LOCATION-bigquery.googleapis.com","bqconnectionclient":"LOCATION-bigqueryconnection.googleapis.com","bqstoragereadclient":"LOCATION-bigquerystorage.googleapis.com",}

ReplaceLOCATION with the name of the BigQuerylocation that you want to connect to.

Use thebigframes.ml.llm module

In BigQuery DataFrames version 2.0, the defaultmodel_name forGeminiTextGenerator has been updated to"gemini-2.0-flash-001". It isrecommended that you supply amodel_name directly to avoid breakages if thedefault model changes in the future.

importbigframes.ml.llmmodel=bigframes.ml.llm.GeminiTextGenerator(model_name="gemini-2.0-flash-001")

Data manipulation

The following sections describe the data manipulation capabilities forBigQuery DataFrames. You can find the functions that are described inthebigframes.bigquery library.

pandas API

A notable feature of BigQuery DataFrames is that thebigframes.pandas APIis designed to be similar to APIs in the pandas library. This design lets you employfamiliar syntax patterns for data manipulation tasks. Operations defined throughthe BigQuery DataFrames API are executed server-side, operating directlyon data stored within BigQuery and eliminating the need totransfer datasets out of BigQuery.

To check which pandas APIs are supported by BigQuery DataFrames, seeSupported pandas APIs.

Inspect and manipulate data

You can use thebigframes.pandas API to perform data inspection andcalculation operations. The following code sample uses thebigframes.pandaslibrary to inspect thebody_mass_g column, calculate the meanbody_mass, andcalculate the meanbody_mass byspecies:

importbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"bq_df=bpd.read_gbq(query_or_table)# Inspect one of the columns (or series) of the DataFrame:bq_df["body_mass_g"]# Compute the mean of this series:average_body_mass=bq_df["body_mass_g"].mean()print(f"average_body_mass:{average_body_mass}")# Find the heaviest species using the groupby operation to calculate the# mean body_mass_g:(bq_df["body_mass_g"].groupby(by=bq_df["species"]).mean().sort_values(ascending=False).head(10))

BigQuery library

The BigQuery library provides BigQuery SQLfunctions that might not have a pandas equivalent. The following sectionspresent some examples.

Process array values

You can use thebigframes.bigquery.array_agg() function in thebigframes.bigquery library to aggregate values after agroupby operation:

importbigframes.bigqueryasbbqimportbigframes.pandasasbpds=bpd.Series([0,1,2,3,4,5])# Group values by whether they are divisble by 2 and aggregate them into arraysbbq.array_agg(s.groupby(s%2==0))# False    [1 3 5]# True     [0 2 4]# dtype: list<item: int64>[pyarrow]

You can also use thearray_length() andarray_to_string() array functions.

Create a structSeries object

You can use thebigframes.bigquery.struct() function in thebigframes.bigquery library to create a new structSeries object withsubfields for each column in a DataFrame:

importbigframes.bigqueryasbbqimportbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"bq_df=bpd.read_gbq(query_or_table)# Create a new STRUCT Series with subfields for each column in a DataFrames.lengths=bbq.struct(bq_df[["culmen_length_mm","culmen_depth_mm","flipper_length_mm"]])lengths.peek()# 146{'culmen_length_mm': 51.1, 'culmen_depth_mm': ...# 278{'culmen_length_mm': 48.2, 'culmen_depth_mm': ...# 337{'culmen_length_mm': 36.4, 'culmen_depth_mm': ...# 154{'culmen_length_mm': 46.5, 'culmen_depth_mm': ...# 185{'culmen_length_mm': 50.1, 'culmen_depth_mm': ...# dtype: struct[pyarrow]

Convert timestamps to Unix epochs

You can use thebigframes.bigquery.unix_micros() function in thebigframes.bigquery library to convert timestamps into Unix microseconds:

importpandasaspdimportbigframes.bigqueryasbbqimportbigframes.pandasasbpd# Create a series that consists of three timestamps: [1970-01-01, 1970-01-02, 1970-01-03]s=bpd.Series(pd.date_range("1970-01-01",periods=3,freq="d",tz="UTC"))bbq.unix_micros(s)# 0               0# 1     86400000000# 2    172800000000# dtype: Int64

You can also use theunix_seconds() andunix_millis() time functions.

Use the SQL scalar function

You can use thebigframes.bigquery.sql_scalar() function in thebigframes.bigquery library to access arbitrary SQL syntax representing asingle column expression:

importbigframes.bigqueryasbbqimportbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"# The sql_scalar function can be used to inject SQL syntax that is not supported# or difficult to express with the bigframes.pandas APIs.bq_df=bpd.read_gbq(query_or_table)shortest=bbq.sql_scalar("LEAST({0},{1},{2})",columns=[bq_df["culmen_depth_mm"],bq_df["culmen_length_mm"],bq_df["flipper_length_mm"],],)shortest.peek()#         0# 14918.9# 3316.3# 29617.2# 28717.0# 30715.0# dtype: Float64

Custom Python functions

BigQuery DataFrames lets you turn your custom Python functions intoBigQuery artifacts that you can run onBigQuery DataFrames objects at scale. This extensibility supportlets you perform operations beyond what is possible withBigQuery DataFrames and SQL APIs, so you can potentially take advantageof open source libraries. The two variants of this extensibility mechanism aredescribed in the following sections.

User-defined functions (UDFs)

With UDFs (Preview),you can turn your custom Python function into aPython UDF.For an example usage, seeCreate a persistent Python UDF.

Creating a UDF in BigQuery DataFrames creates a BigQueryroutine as the Python UDF in the specified dataset. For a full set ofsupported parameters, seeudf.

Clean up

In addition to cleaning up the cloud artifacts directly in the Google Cloud consoleor with other tools, you can clean up the BigQuery DataFrames UDFs thatwere created with an explicit name argument by using thebigframes.pandas.get_global_session().bqclient.delete_routine(routine_id)command.

Requirements

To use a BigQuery DataFrames UDF, enable theBigQuery APIin your project. If you're providing thebigquery_connection parameter inyour project, you must also enable theBigQuery Connection API.

Limitations

  • The code in the UDF must be self-contained, meaning, it must not contain anyreferences to an import or variable defined outside of the function body.
  • The code in the UDF must be compatible with Python 3.11, as that is theenvironment in which the code is executed in the cloud.
  • Re-running the UDF definition code after trivial changes in the functioncode—for example, renaming a variable or inserting a new line—causes the UDF tobe re-created, even if these changes are inconsequential to the behavior of thefunction.
  • The user code is visible to users with read access on theBigQuery routines, so you should include sensitive content onlywith caution.
  • A project can have up to 1,000 Cloud Run functions at a time in aBigQuery location.

The BigQuery DataFrames UDF deploys a user-definedBigQuery Python function, and the relatedlimitationsapply.

Remote functions

BigQuery DataFrames lets you turn your custom scalar functions intoBigQuery remote functions.For an example usage, seeCreate a remote function.For a full set of supported parameters, seeremote_function.

Creating a remote function in BigQuery DataFrames creates the following:

  • ACloud Run function.
  • ABigQuery connection.By default, a connection namedbigframes-default-connection is used. You canuse a pre-configured BigQuery connection if you prefer, in whichcase the connection creation is skipped. The service account for the defaultconnection is granted theCloud Run role(roles/run.invoker).
  • A BigQuery remote function that uses the Cloud Runfunction that's been created with the BigQuery connection.

BigQuery connections are created in the same location as theBigQuery DataFrames session, using the name you provide in the customfunction definition. To view and manage connections, do the following:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. Select the project in which you created the remote function.

  3. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

  4. In theExplorer pane, expand the project and then clickConnections.

BigQuery remote functions are created in the dataset you specify,or they are created in an anonymous dataset, which is a type ofhidden dataset.If you don't set a name for a remote function during its creation,BigQuery DataFrames applies a default name that begins with thebigframes prefix. To view and manage remote functions created in auser-specified dataset, do the following:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. Select the project in which you created the remote function.

  3. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

  4. In theExplorer pane, expand the project, and then clickDatasets.

  5. Click the dataset in which you created the remote function.

  6. Click theRoutines tab.

To view and manage Cloud Run functions, do the following:

  1. Go to theCloud Run page.

    Go to Cloud Run

  2. Select the project in which you created the function.

  3. Filter on theFunction Deployment type in the list of available services.

  4. To identify functions created by BigQuery DataFrames, look forfunction names with thebigframes prefix.

Clean up

In addition to cleaning up the cloud artifacts directly in the Google Cloud consoleor with other tools, you can clean up the BigQuery remotefunctions that were created without an explicit name argument and theirassociated Cloud Run functions in the following ways:

  • For a BigQuery DataFrames session, use thesession.close() command.
  • For the default BigQuery DataFrames session, use thebigframes.pandas.close_session() command.
  • For a past session withsession_id, use thebigframes.pandas.clean_up_by_session_id(session_id) command.

You can also clean up the BigQuery remote functions that werecreated with an explicit name argument and their associatedCloud Run functions by using thebigframes.pandas.get_global_session().bqclient.delete_routine(routine_id)command.

Requirements

To use BigQuery DataFrames remote functions, you must enable thefollowing APIs:

Limitations

  • Remote functions take about 90 seconds to become usable when you first createthem. Additional package dependencies might add to the latency.
  • Re-running the remote function definition code after trivial changes in andaround the function code—for example, renaming a variable, inserting a new line,or inserting a new cell in the notebook—might cause the remote function to bere-created, even if these changes are inconsequential to the behavior of thefunction.
  • The user code is visible to users with read access on theCloud Run functions, so you should include sensitive contentonly with caution.
  • A project can have up to 1,000 Cloud Run functions at a timein a region. For more information, seeQuotas.

ML and AI

The following sections describe the ML and AI capabilities forBigQuery DataFrames. These capabilities use thebigframes.ml library.

ML locations

Thebigframes.ml library supports the same locations asBigQuery ML. BigQuery ML model prediction and otherML functions are supported in all BigQuery regions. Support formodel training varies by region. For more information, seeBigQuery ML locations.

Preprocess data

Create transformers to prepare data for use in estimators (models) byusing thebigframes.ml.preprocessing moduleand thebigframes.ml.compose module.BigQuery DataFrames offers the following transformations:

  • Use theKBinsDiscretizer classin thebigframes.ml.preprocessing module to bin continuous datainto intervals.

  • Use theLabelEncoder classin thebigframes.ml.preprocessing module to normalize the targetlabels as integer values.

  • Use theMaxAbsScaler classin thebigframes.ml.preprocessing module to scale each feature tothe range[-1, 1] by its maximum absolute value.

  • Use theMinMaxScaler classin thebigframes.ml.preprocessing module to standardize featuresby scaling each feature to the range[0, 1].

  • Use theStandardScaler classin thebigframes.ml.preprocessing module to standardize featuresby removing the mean and scaling to unit variance.

  • Use theOneHotEncoder classin thebigframes.ml.preprocessing module to transform categoricalvalues into numeric format.

  • Use theColumnTransformer classin thebigframes.ml.compose module to apply transformers toDataFrames columns.

Train models

You can create estimators to train models in BigQuery DataFrames.

Clustering models

You can create estimators for clustering models by using thebigframes.ml.cluster module.

  • Use theKMeans classto create K-means clustering models. Use these models for datasegmentation. For example, identifying customer segments. K-means isan unsupervised learning technique, so model training doesn'trequire labels or split data for training or evaluation.

You can use thebigframes.ml.cluster module to create estimators forclustering models.

The following code sample shows using thebigframes.ml.cluster KMeansclass to create a k-means clustering model for data segmentation:

frombigframes.ml.clusterimportKMeansimportbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"bq_df=bpd.read_gbq(query_or_table)# Create the KMeans modelcluster_model=KMeans(n_clusters=10)cluster_model.fit(bq_df["culmen_length_mm"],bq_df["sex"])# Predict using the modelresult=cluster_model.predict(bq_df)# Score the modelscore=cluster_model.score(bq_df)

Decomposition models

You can create estimators for decomposition models by using thebigframes.ml.decomposition module.

  • Use thePCAclassto create principal component analysis (PCA) models. Use thesemodels for computing principal components and using them to performa change of basis on the data. This provides dimensionalityreduction by projecting each data point onto only the first fewprincipal components to obtain lower-dimensional data whilepreserving as much of the data's variation as possible.

Ensemble models

You can create estimators for ensemble models by using thebigframes.ml.ensemble module.

  • Use theRandomForestClassifier classto create random forest classifier models. Use these models forconstructing multiple learning method decision trees forclassification.

  • Use theRandomForestRegressor classto create random forest regression models. Use these models forconstructing multiple learning method decision trees for regression.

  • Use theXGBClassifier classto create gradient boosted tree classifier models. Use these modelsfor additively constructing multiple learning method decision treesfor classification.

  • Use theXGBRegressor classto create gradient boosted tree regression models. Use these modelsfor additively constructing multiple learning method decision treesfor regression.

Forecasting models

You can create estimators for forecasting models by using thebigframes.ml.forecasting module.

Imported models

You can create estimators for imported models by using thebigframes.ml.imported module.

Linear models

Create estimators for linear models by using thebigframes.ml.linear_model module.

  • Use theLinearRegression classto create linear regression models. Use these models forforecasting. For example, forecasting the sales of an item on agiven day.

  • Use theLogisticRegression classto create logistic regression models. Use these models for theclassification of two or more possible values such as whether aninput islow-value,medium-value, orhigh-value.

The following code sample shows usingbigframes.ml to do thefollowing:

frombigframes.ml.linear_modelimportLinearRegressionimportbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"bq_df=bpd.read_gbq(query_or_table)# Filter down to the data to the Adelie Penguin speciesadelie_data=bq_df[bq_df.species=="Adelie Penguin (Pygoscelis adeliae)"]# Drop the species columnadelie_data=adelie_data.drop(columns=["species"])# Drop rows with nulls to get training datatraining_data=adelie_data.dropna()# Specify your feature (or input) columns and the label (or output) column:feature_columns=training_data[["island","culmen_length_mm","culmen_depth_mm","flipper_length_mm","sex"]]label_columns=training_data[["body_mass_g"]]test_data=adelie_data[adelie_data.body_mass_g.isnull()]# Create the linear modelmodel=LinearRegression()model.fit(feature_columns,label_columns)# Score the modelscore=model.score(feature_columns,label_columns)# Predict using the modelresult=model.predict(test_data)

Large language models

You can create estimators for LLMs by using thebigframes.ml.llm module.

Use theGeminiTextGenerator classto create Gemini text generator models. Use these models for textgeneration tasks.

Use thebigframes.ml.llmmodule to create estimators for remote large language models (LLMs).
The following code sample shows using thebigframes.ml.llmGeminiTextGeneratorclass to create a Gemini model for code generation:

frombigframes.ml.llmimportGeminiTextGeneratorimportbigframes.pandasasbpd# Create the Gemini LLM modelsession=bpd.get_global_session()connection=f"{PROJECT_ID}.{REGION}.{CONN_NAME}"model=GeminiTextGenerator(session=session,connection_name=connection,model_name="gemini-2.0-flash-001")df_api=bpd.read_csv("gs://cloud-samples-data/vertex-ai/bigframe/df.csv")# Prepare the prompts and send them to the LLM model for predictiondf_prompt_prefix="Generate Pandas sample code for DataFrame."df_prompt=df_prompt_prefix+df_api["API"]# Predict using the modeldf_pred=model.predict(df_prompt.to_frame(),max_output_tokens=1024)

Remote models

To use BigQuery DataFrames ML remote models (bigframes.ml.remoteorbigframes.ml.llm), you must enable the following APIs:

Creating a remote model in BigQuery DataFrames creates aBigQuery connection.By default, a connection of the namebigframes-default-connection is used. Youcan use a pre-configured BigQuery connection if you prefer,in which case the connection creation is skipped. The service accountfor the default connection is granted theVertex AI User role (roles/aiplatform.user) on the project.

Create pipelines

You can create ML pipelines by usingbigframes.ml.pipeline module.Pipelines let you assemble several ML steps to be cross-validated together whilesetting different parameters. This simplifies your code, and lets you deploydata preprocessing steps and an estimator together.

Use thePipeline classto create a pipeline of transforms with a final estimator.

Select models

Use thebigframes.ml.model_selection modulemodule to split your training and testing datasets and select the best models:

  • Use thetrain_test_split functionto split the data into training and testing (evaluation) sets, as shown in thefollowing code sample:

    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
  • Use theKFold classand theKFold.split methodto create multi-fold training and testing sets to train and evaluate models,as shown in the following code sample. This feature is valuable for smalldatasets.

    kf=KFold(n_splits=5)fori,(X_train,X_test,y_train,y_test)inenumerate(kf.split(X,y)):# Train and evaluate models with training and testing sets
  • Use thecross_validate functionto automatically create multi-fold training and testing sets, train and evaluate the model, and get the result of each fold, as shown in the following code sample:

    scores=cross_validate(model,X,y,cv=5)

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.