Query open table formats with manifests

Important: The term "BigLake" on this page refers to an accessdelegation functionality for external tables in BigQuery. Forinformation about BigLake, the stand-alone Google Cloudproduct that includes BigLake metastore, the Apache Iceberg REST catalog,and BigLake tables for Apache Iceberg seeBigLake overview.

This document describes how to use manifest files to query data stored in opentable formats such asApache Hudi andDelta Lake.

Some open table formats such as Hudi and Delta Lakeexport their current state as one or more manifest files. A manifest file containsa list of data files that make tables. With the manifest support inBigQuery, you can query and load data stored in open table formats.

Before you begin

Required roles

To query BigLake tables based on Hudi andDelta Lake data, ensure you have the following roles:

  • BigQuery Connection User (roles/bigquery.connectionUser)
  • BigQuery Data Viewer (roles/bigquery.dataViewer)
  • BigQuery User (roles/bigquery.user)

You can also query Hudi external tables. However, werecommend you toupgrade the external table to BigLake.To query Hudi external tables, ensure you have thefollowing roles:

  • BigQuery Data Viewer (roles/bigquery.dataViewer)
  • BigQuery User (roles/bigquery.user)
  • Storage Object Viewer (roles/storage.objectViewer)

Depending on your permissions, you cangrant these roles to yourself or ask your administratorto grant them to you. For more information about granting roles, seeViewing the grantable roles on resources.

To see the exact permissions that are required to queryBigLake tables, expand theRequired permissions section:

Required permissions

You might also be able to get these permissions withcustom rolesor otherpredefined roles.

Query Hudi workloads

Important: To perform the actions described in the following sections, you must use either theHudi-BigQueryconnector version 0.14.0 or greater,or the Hudi component in Dataproc 2.1, which has the appropriate version of the connector backported.In the previous versions, the connector created views on the manifest files whichwasn't optimal for query performance. If you are using the previous versionof the connector, then you must drop the existing view that represents theHudi table in BigQuery to avoid schemamismatch error.

Toquery Hudi data,follow these steps:

  1. Create an external table based onHudi data.
  2. Upgrade the external table to BigLake.

Create Hudi external tables

When you sync tables by using the sync tool for Hudi andBigQuery, enable theuse-bq-manifest-file flag to transitionto the manifest file approach. This flag also exports a manifest file in aformat supported by BigQuery and uses it to create an externaltable with the name specified in the--table parameter.

To create a Hudi external table, follow these steps:

  1. To create a Hudi external table,submit a job to an existingDataproc cluster. When you build theHudi-BigQuery connector, enable theuse-bq-manifest-file flagto transition to the manifest file approach. This flag exports amanifest file in a format supported by BigQuery and uses it tocreate an external table with the name specified in the--table parameter.

    spark-submit \   --master yarn \   --packages com.google.cloud:google-cloud-bigquery:2.10.4 \   --class org.apache.hudi.gcp.bigquery.BigQuerySyncTool  \JAR \   --project-idPROJECT_ID \   --dataset-nameDATASET \   --dataset-locationLOCATION \   --tableTABLE \   --source-uriURI  \   --source-uri-prefixURI_PREFIX \   --base-pathBASE_PATH  \   --partitioned-byPARTITION_BY \   --use-bq-manifest-file

    Replace the following:

    • JAR: If you are using the Hudi-BigQuery connector, specifyhudi-gcp-bundle-0.14.0.jar. If you are usingthe Hudi component in Dataproc 2.1, specify/usr/lib/hudi/tools/bq-sync-tool/hudi-gcp-bundle-0.12.3.1.jar

    • PROJECT_ID: the project ID in which you want tocreate the Hudi BigLake table

    • DATASET: the dataset in which you want to createthe Hudi BigLake table

    • LOCATION: the location in which you want to createthe Hudi BigLake table

    • TABLE: the name of the table that you want to create

      If you are transitioning from the earlier version of theHudi-BigQueryconnector (0.13.0 and earlier) that created views onthe manifest files, ensure that you use the same table name as it lets youkeep the existing downstream pipeline code.

    • URI: the Cloud Storage URI that you createdto store the Hudi manifest file

      This URI points to the first level partition; make sure to include thepartition key. For example,gs://mybucket/hudi/mydataset/EventDate=*

    • URI_PREFIX: the prefix for the Cloud StorageURI path, usually it's the path to Hudi tables

    • BASE_PATH: the base path for Huditables

      For example,gs://mybucket/hudi/mydataset/

    • PARTITION_BY: the partition value

      For example,EventDate

    For more information about the connector's configuration, seeHudi-BigQuery connector.

  2. To set appropriate fine-grained controls or to accelerate the performanceby enabling metadata caching, seeUpgrade BigLake tables.

Query Delta workloads

Delta tables are nownatively supported. We recommend creating Delta BigLake tables for Delta workloads. Delta Lake BigLake tables support more advancedDelta Lake tables, including tables with column remapping and deletion vectors. Additionally, Delta BigLake tables directly read the latest snapshot, so updates are instantly available.

Toquery Delta workloads,follow these steps:

  1. Generate a manifest file.
  2. Create a BigLake tablebased on the manifest file.
  3. Set appropriate fine-grained controls or accelerate the performanceby enabling metadata caching. To do this, seeUpgrade BigLake tables.

Generate a manifest file

BigQuery supports the manifest file in aSymLinkTextInputFormat format, which is anewline-delimited list of URIs. For more information about generating a manifestfile, seeSet up Presto to Delta Lake integration and query Delta tables.

To generate a manifest file,submit a jobto an existing Dataproc cluster:

SQL

Using Spark, run the following command on a Delta table at locationpath-to-delta-table:

GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`

Scala

Using Spark, run the following command on a Delta table at locationpath-to-delta-table:

val deltaTable = DeltaTable.forPath(<path-to-delta-table>)deltaTable.generate("symlink_format_manifest")

Java

Using Spark, run the following command on a Delta table at locationpath-to-delta-table:

DeltaTable deltaTable = DeltaTable.forPath(<path-to-delta-table>);deltaTable.generate("symlink_format_manifest");

Python

Using Spark, run the following command on a Delta table at locationpath-to-delta-table:

deltaTable = DeltaTable.forPath(<path-to-delta-table>)deltaTable.generate("symlink_format_manifest")

Create Delta BigLake tables

To create a Delta BigLake table, use theCREATE EXTERNAL TABLE statement with thefile_set_spec_type field set toNEW_LINE_DELIMITED_MANIFEST:

  1. Go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, run theCREATE EXTERNAL TABLE statement:

    CREATEEXTERNAL TABLEPROJECT_ID.DATASET_NAME.TABLE_NAMEWITHPARTITION COLUMNS(`PARTITION_COLUMN PARTITION_COLUMN_TYPE`,)WITHCONNECTION `PROJECT_IDREGION.CONNECTION_NAME`OPTIONS (   format = "DATA_FORMAT",   uris = ["URI"],   file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST',   hive_partition_uri_prefix = "PATH_TO_DELTA_TABLE"   max_staleness =STALENESS_INTERVAL,   metadata_cache_mode = 'CACHE_MODE');

    Replace the following:

    • DATASET_NAME: the name of the dataset you created
    • TABLE_NAME: the name you want to give to this table
    • REGION: the location where the connection is located (for example,us-east1)
    • CONNECTION_NAME: the name of the connection you created
    • DATA_FORMAT: any of the supportedformats (such asPARQUET)
    • URI: the path to the manifest file (for example,gs://mybucket/path)
    • PATH_TO_DELTA_TABLE: a common prefix for all source URIs before the partition key encoding begins
    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the BigLake table, and how fresh the cached metadata must be in order for the operation to use it. For more information about metadata caching considerations, seeMetadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify aninterval literal value between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOUR for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Delta Lake instead.

    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually. For more information about metadata caching considerations, seeMetadata caching for performance.

      Set toAUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set toMANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a value greater than 0.

    Example:

    CREATEEXTERNALTABLEmydataset.mytableWITHCONNECTION`us-east1.myconnection`OPTIONS(format="PARQUET",uris=["gs://mybucket/path/partitionpath=*"],file_set_spec_type='NEW_LINE_DELIMITED_MANIFEST'hive_partition_uri_prefix="gs://mybucket/path/"max_staleness=INTERVAL1DAY,metadata_cache_mode='AUTOMATIC');

Upgrade BigLake tables

You can alsoaccelerate the performance of your workloads by taking advantage ofmetadata caching andmaterialized views. If you want to usemetadata caching, you can specify settings for this atthe same time. To get table details such as source format and source URI, seeGet table information.

To update an external table to a BigLake table or update anexisting BigLake, select one of the following options:

SQL

Use theCREATE OR REPLACE EXTERNAL TABLE DDL statementto update a table:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATEORREPLACEEXTERNALTABLE`PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`WITHCONNECTION{`REGION.CONNECTION_ID`|DEFAULT}OPTIONS(format="TABLE_FORMAT",uris=['BUCKET_PATH'],max_staleness=STALENESS_INTERVAL,metadata_cache_mode='CACHE_MODE');

    Replace the following:

    • PROJECT_ID: the name of the project that contains the table
    • DATASET: the name of the dataset that contains the table
    • EXTERNAL_TABLE_NAME: the name of the table
    • REGION: the region that contains the connection
    • CONNECTION_ID: the name of the connection to use

      To use a default connection, specifyDEFAULT instead of the connection string containingREGION.CONNECTION_ID.

    • TABLE_FORMAT: the format used by the table

      You can't change this when updating the table.

    • BUCKET_PATH: the path to the Cloud Storage bucket that contains the data for the external table, in the format['gs://bucket_name/[folder_name/]file_name'].

      You can select multiple files from the bucket by specifying one asterisk (*) wildcard character in the path. For example,['gs://mybucket/file_name*']. For more information, seeWildcard support for Cloud Storage URIs.

      You can specify multiple buckets for theuris option by providing multiple paths.

      The following examples show validuris values:

      • ['gs://bucket/path1/myfile.csv']
      • ['gs://bucket/path1/*.csv']
      • ['gs://bucket/path1/*', 'gs://bucket/path2/file00*']

      When you specifyuris values that target multiple files, all of those files must share a compatible schema.

      For more information about using Cloud Storage URIs in BigQuery, seeCloud Storage resource path.

    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the table, and how fresh the cached metadata must be in order for the operation to use it

      For more information about metadata caching considerations, seeMetadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify aninterval literal value between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOUR for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.

    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually

      For more information on metadata caching considerations, seeMetadata caching for performance.

      Set toAUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set toMANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a value greater than 0.

  3. ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

Use thebq mkdef andbq update commandsto update a table:

  1. Generate anexternal table definition,that describes the aspects of the table to change:

    bqmkdef--connection_id=PROJECT_ID.REGION.CONNECTION_ID\--source_format=TABLE_FORMAT\--metadata_cache_mode=CACHE_MODE\"BUCKET_PATH">/tmp/DEFINITION_FILE

    Replace the following:

    • PROJECT_ID: the name of the project thatcontains the connection
    • REGION: the region that contains theconnection
    • CONNECTION_ID: the name of the connectionto use
    • TABLE_FORMAT: the format used by thetable. You can't change this when updating the table.
    • CACHE_MODE: specifies whether the metadatacache is refreshed automatically or manually. For more informationon metadata caching considerations, seeMetadata caching for performance.

      Set toAUTOMATIC for the metadata cache to be refreshed at asystem-defined interval, usually somewhere between 30 and60 minutes.

      Set toMANUAL if you want to refresh the metadata cache on aschedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refreshthe cache.

      You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a valuegreater than 0.

    • BUCKET_PATH: the path to theCloud Storage bucket that contains the data for theexternal table, in the formatgs://bucket_name/[folder_name/]file_name.

      You can limit the files selected from the bucket by specifying one asterisk (*)wildcard character in the path. For example,gs://mybucket/file_name*. For moreinformation, seeWildcard support for Cloud Storage URIs.

      You can specify multiple buckets for theuris option by providing multiplepaths.

      The following examples show validuris values:

      • gs://bucket/path1/myfile.csv
      • gs://bucket/path1/*.csv
      • gs://bucket/path1/*,gs://bucket/path2/file00*

      When you specifyuris values that target multiple files, all of thosefiles must share a compatible schema.

      For more information about using Cloud Storage URIs inBigQuery, seeCloud Storage resource path.

    • DEFINITION_FILE: the name of the tabledefinition file that you are creating.

  2. Update the table using the new external table definition:

    bqupdate--max_staleness=STALENESS_INTERVAL\--external_table_definition=/tmp/DEFINITION_FILE\PROJECT_ID:DATASET.EXTERNAL_TABLE_NAME

    Replace the following:

    • STALENESS_INTERVAL: specifies whethercached metadata is used by operations against thetable, and how fresh the cached metadata must be in order forthe operation to use it. For more information about metadatacaching considerations, seeMetadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify an interval value between 30minutes and 7 days, using theY-M D H:M:S format described in theINTERVAL data typedocumentation. For example, specify0-0 0 4:0:0 for a 4hour staleness interval.With this value, operations against the table use cached metadata ifit has been refreshed within the past 4 hours. If the cached metadatais older than that, the operation retrieves metadata fromCloud Storage instead.

    • DEFINITION_FILE: the name of the tabledefinition file that you created or updated.

    • PROJECT_ID: the name of the projectthat contains the table

    • DATASET: the name of the dataset thatcontains the table

    • EXTERNAL_TABLE_NAME: the name of the table

Query BigLake and external tables

After creating a BigLake table, you canquery it using GoogleSQL syntax, thesame as if it were a standard BigQuery table.For example,SELECT field1, field2 FROM mydataset.my_cloud_storage_table;.

Limitations

  • BigQuery only supports queryingDelta Lake reader v1tables.

  • Hudi and BigQuery integration onlyworks for hive-style partitionedcopy-on-write tables.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.