Manipulate data with BigQuery DataFrames

This document describes the data manipulation capabilities available withBigQuery DataFrames. You can find the functions that are described in thebigframes.bigquery library.

Required roles

To get the permissions that you need to complete the tasks in this document, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, seeManage access to projects, folders, and organizations.

You might also be able to get the required permissions throughcustom roles or otherpredefined roles.

When you perform end user authentication in an interactiveenvironment like a notebook, Python REPL, or the command line,BigQuery DataFrames prompts for authentication, if needed.Otherwise, seehow to set up application default credentialsfor various environments.

pandas API

A notable feature of BigQuery DataFrames is that thebigframes.pandas APIis designed to be similar to APIs in the pandas library. This design lets youemployfamiliar syntax patterns for data manipulation tasks. Operations defined throughthe BigQuery DataFrames API are executed server-side, operating directlyon data stored within BigQuery and eliminating the need totransfer datasets out of BigQuery.

To check which pandas APIs are supported by BigQuery DataFrames, seeSupported pandas APIs.

Inspect and manipulate data

You can use thebigframes.pandas API to perform data inspection andcalculation operations. The following code sample uses thebigframes.pandaslibrary to inspect thebody_mass_g column, calculate the meanbody_mass, andcalculate the meanbody_mass byspecies:

importbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"bq_df=bpd.read_gbq(query_or_table)# Inspect one of the columns (or series) of the DataFrame:bq_df["body_mass_g"]# Compute the mean of this series:average_body_mass=bq_df["body_mass_g"].mean()print(f"average_body_mass:{average_body_mass}")# Find the heaviest species using the groupby operation to calculate the# mean body_mass_g:(bq_df["body_mass_g"].groupby(by=bq_df["species"]).mean().sort_values(ascending=False).head(10))

BigQuery library

The BigQuery library provides BigQuery SQLfunctions that might not have a pandas equivalent. The following sectionspresent some examples.

Process array values

You can use thebigframes.bigquery.array_agg() function in thebigframes.bigquery library to aggregate values after agroupby operation:

importbigframes.bigqueryasbbqimportbigframes.pandasasbpds=bpd.Series([0,1,2,3,4,5])# Group values by whether they are divisble by 2 and aggregate them into arraysbbq.array_agg(s.groupby(s%2==0))# False    [1 3 5]# True     [0 2 4]# dtype: list<item: int64>[pyarrow]

You can also use thearray_length() andarray_to_string() array functions.

Create a structSeries object

You can use thebigframes.bigquery.struct() function in thebigframes.bigquery library to create a new structSeries object withsubfields for each column in a DataFrame:

importbigframes.bigqueryasbbqimportbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"bq_df=bpd.read_gbq(query_or_table)# Create a new STRUCT Series with subfields for each column in a DataFrames.lengths=bbq.struct(bq_df[["culmen_length_mm","culmen_depth_mm","flipper_length_mm"]])lengths.peek()# 146{'culmen_length_mm': 51.1, 'culmen_depth_mm': ...# 278{'culmen_length_mm': 48.2, 'culmen_depth_mm': ...# 337{'culmen_length_mm': 36.4, 'culmen_depth_mm': ...# 154{'culmen_length_mm': 46.5, 'culmen_depth_mm': ...# 185{'culmen_length_mm': 50.1, 'culmen_depth_mm': ...# dtype: struct[pyarrow]

Convert timestamps to Unix epochs

You can use thebigframes.bigquery.unix_micros() function in thebigframes.bigquery library to convert timestamps into Unix microseconds:

importpandasaspdimportbigframes.bigqueryasbbqimportbigframes.pandasasbpd# Create a series that consists of three timestamps: [1970-01-01, 1970-01-02, 1970-01-03]s=bpd.Series(pd.date_range("1970-01-01",periods=3,freq="d",tz="UTC"))bbq.unix_micros(s)# 0               0# 1     86400000000# 2    172800000000# dtype: Int64

You can also use theunix_seconds() andunix_millis() time functions.

Use the SQL scalar function

You can use thebigframes.bigquery.sql_scalar() function in thebigframes.bigquery library to access arbitrary SQL syntax representing asingle-column expression:

importbigframes.bigqueryasbbqimportbigframes.pandasasbpd# Load data from BigQueryquery_or_table="bigquery-public-data.ml_datasets.penguins"# The sql_scalar function can be used to inject SQL syntax that is not supported# or difficult to express with the bigframes.pandas APIs.bq_df=bpd.read_gbq(query_or_table)shortest=bbq.sql_scalar("LEAST({0},{1},{2})",columns=[bq_df["culmen_depth_mm"],bq_df["culmen_length_mm"],bq_df["flipper_length_mm"],],)shortest.peek()#         0# 14918.9# 3316.3# 29617.2# 28717.0# 30715.0# dtype: Float64

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.