Programmatic analysis tools

This document describes multiple ways for writing and running code toanalyze data managed in BigQuery.

Although SQL is a powerful query language, programming languages such asPython, Java, or R provide syntaxes and a large array of built-in statisticalfunctions that data analysts might find more expressive and easier to manipulatefor certain types of data analysis.

Similarly, while spreadsheets are widely used, other programming environmentslike notebooks can sometimes provide a more flexible environment for doingcomplex data analysis and exploration.

Colab Enterprise notebooks

You can useColab Enterprise notebooksin BigQuery to completeanalysis and machine learning (ML) workflows by using SQL, Python, and othercommon packages and APIs. Notebooks offer improved collaboration and managementwith the following options:

  • Share notebooks with specific users and groups by usingIdentity and Access Management (IAM).
  • Review the notebook version history.
  • Revert to or branch from previous versions of the notebook.

Notebooks areBigQuery Studiocode assets that are powered byDataform,although notebooks aren't visible in Dataform.Saved queries are also code assets.All code assets are stored in a defaultregion. Updating the default region changesthe region for all code assets that are created after that point.

Notebook capabilities are available only in the Google Cloud console.

Notebooks in BigQuery offer the following benefits:

  • BigQuery DataFrames isintegrated into notebooks, no setup required. BigQuery DataFrames isa Python API that you can use to analyze BigQuery data atscale by using thepandas DataFrameandscikit-learn APIs.
  • Assistive code development powered byGemini generative AI.
  • Auto-completion of SQL statements, the same as in theBigQuery editor.
  • The ability to save, share, and manage versions of notebooks.
  • The ability to usematplotlib,seaborn, and other popularlibraries to visualize data at any point in your workflow.
  • The ability to write andexecute SQLin a cell that can reference Python variables from your notebook.
  • InteractiveDataFrame visualizationthat supports aggregation and customization.

You can get started with notebooks by using notebook gallery templates. For moreinformation, seeCreate a notebook using the notebook gallery.

BigQuery DataFrames

BigQuery DataFramesis a set of open source Python libraries that let you take advantage ofBigQuery data processing by using familiar Python APIs.BigQuery DataFrames implements the pandas and scikit-learn APIs bypushing the processing down to BigQuery through SQL conversion.This design lets you use BigQuery to explore and process terabytes ofdata, and also train ML models, all with Python APIs.

BigQuery DataFrames offers the following benefits:

  • More than 750 pandas and scikit-learn APIs implemented throughtransparent SQL conversion to BigQuery andBigQuery ML APIs.
  • Deferred execution of queries for enhanced performance.
  • Extending data transformations with user-defined Python functions to letyou process data in the cloud. These functions areautomatically deployed as BigQueryremote functions.
  • Integration with Vertex AI to let you use Gemini models for textgeneration.

Other programmatic analysis solutions

The following programmatic analysis solutions are also available inBigQuery.

Jupyter notebooks

Jupyter is an open source web-based application forpublishing notebooks that contain live code, textual descriptions, andvisualizations. Data scientists, machine learning specialists, and studentscommonly use this platform for tasks such as data cleaning and transformation,numerical simulation, statistical modeling, data visualization, and ML.

Jupyter Notebooks are built on top of theIPythonkernel, a powerful interactive shell, which can interact directly withBigQuery by using theIPython Magics for BigQuery.Alternatively, you can also access BigQuery from your Jupyternotebooks instances by installing any of the availableBigQuery clients libraries.You can visualizeBigQuery GIS data with Jupyter notebooks throughGeoJSON extension.For more details on the BigQuery integration, see the tutorialVisualizing BigQuery data in a Jupyter notebook.

Jupyter notebook chart showing a visualization of BigQuery GIS data.

JupyterLabis a web-based user interface for managing documents and activities such asJupyter notebooks, text editors, terminals, and custom components. WithJupyterLab, you can arrange multiple documents and activities side by side inthe work area using tabs and splitters.

JupyterLab: using tabs and splitters to arrange multiple documents and activities side by side in the work area.

You can deploy Jupyter notebooks and JupyterLab environments onGoogle Cloud by using one of the following products:

Apache Zeppelin

Apache Zeppelinis an open source project that offers web-based notebooks for data analytics.You can deploy an instance of Apache Zeppelin onDataproc by installing theZeppelin optional component.By default, notebooks are saved in Cloud Storage in theDataproc staging bucket, which is specified by the user orauto-created when the cluster is created. You can change the notebook locationby adding the propertyzeppelin:zeppelin.notebook.gcs.dir when you create thecluster. For more information about installing and configuring Apache Zeppelin,see theZeppelin component guide.For an example, seeAnalyzing BigQuery datasets using BigQuery Interpreter for Apache Zeppelin.

SQL analysis of the table data, shown in Zeppelin.

Apache Hadoop, Apache Spark, and Apache Hive

For part of your data analytics pipeline migration, you might want to migratesome legacyApache Hadoop,Apache Spark,orApache Hivejobs that need to directly process data from your data warehouse. For example,you might extract features for your machine learning workloads.

Dataproc lets you deploy fully managed Hadoop and Spark clustersin an efficient, cost-effective way. Dataproc integrates withopen sourceBigQuery connectors.These connectors use theBigQuery Storage API,which streams data in parallel directly from BigQuery throughgRPC.

When you migrate your existing Hadoop and Spark workloads toDataproc, you can check that your workloads' dependencies arecovered by the supportedDataproc versions.If you need to install custom software, you might considercreating your own Dataproc image,writing your owninitialization actions,orspecifying custom Python package requirements.

To get started, see theDataproc quickstart guides and theBigQuery connector code samples.

Apache Beam

Apache Beamis an open source framework that provides a rich set of windowing and sessionanalysis primitives as well as an ecosystem of source and sink connectors,including aconnector for BigQuery.Apache Beam lets you transform and enrich data both in stream (real time) andbatch (historical) modes with equal reliability and expressiveness.

Dataflow is a fully managed service for running Apache Beam jobs at scale.The Dataflow serverless approach removes operational overheadwith performance, scaling, availability, security, and compliance handledautomatically so you can focus on programming instead of managing serverclusters.

Execution graph with an expanded composite transform (MakeMapView). The subtransform that creates the side input (CreateDataflowView) is selected, and the side input metrics are shown in the Step tab.

You can submit Dataflow jobs in different ways, either throughthecommand-line interface,theJava SDK,or thePython SDK.

If you want to migrate your data queries and pipelines from other frameworks toApache Beam and Dataflow, read about theApache Beam programming model and browse the officialDataflow documentation.

Other resources

BigQuery offers a large array ofclient libraries in multiple programming languages such as Java, Go, Python, JavaScript, PHP, andRuby. Some data analysis frameworks such aspandasprovidepluginsthat interact directly with BigQuery. For some practicalexamples, see theVisualize BigQuery data in a Jupyter notebook tutorial.

Lastly, if you prefer to write programs in a shell environment, you can use thebq command-line tool.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.