Query open table formats with manifests
Important: The term "BigLake" on this page refers to an accessdelegation functionality for external tables in BigQuery. Forinformation about BigLake, the stand-alone Google Cloudproduct that includes BigLake metastore, the Apache Iceberg REST catalog,and BigLake tables for Apache Iceberg seeBigLake overview.This document describes how to use manifest files to query data stored in opentable formats such asApache Hudi andDelta Lake.
Some open table formats such as Hudi and Delta Lakeexport their current state as one or more manifest files. A manifest file containsa list of data files that make tables. With the manifest support inBigQuery, you can query and load data stored in open table formats.
Before you begin
Enable the BigQuery Connection,BigQuery Reservation, and BigLake APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.To create BigLake tables, you canrun the Spark commands by using one of the following methods:
Create a Dataproc cluster.For querying Hudi tables, set the
--optional-componentsfield toHUDI. For querying Delta tables, set--optional-componentstoPresto.Use a stored procedure for Spark inBigQuery. To do so, follow these steps:
To store the manifest file in Cloud Storage,create a Cloud Storage bucket. Youneed to connect to your Cloud Storage bucket to access the manifest file.To do so, follow these steps:
Required roles
To query BigLake tables based on Hudi andDelta Lake data, ensure you have the following roles:
- BigQuery Connection User (
roles/bigquery.connectionUser) - BigQuery Data Viewer (
roles/bigquery.dataViewer) - BigQuery User (
roles/bigquery.user)
You can also query Hudi external tables. However, werecommend you toupgrade the external table to BigLake.To query Hudi external tables, ensure you have thefollowing roles:
- BigQuery Data Viewer (
roles/bigquery.dataViewer) - BigQuery User (
roles/bigquery.user) - Storage Object Viewer (
roles/storage.objectViewer)
Depending on your permissions, you cangrant these roles to yourself or ask your administratorto grant them to you. For more information about granting roles, seeViewing the grantable roles on resources.
To see the exact permissions that are required to queryBigLake tables, expand theRequired permissions section:
Required permissions
bigquery.connections.usebigquery.jobs.createbigquery.readsessions.create(Only required if you arereading data with the BigQuery Storage Read API)bigquery.tables.getbigquery.tables.getData
You might also be able to get these permissions withcustom rolesor otherpredefined roles.
Query Hudi workloads
Important: To perform the actions described in the following sections, you must use either theHudi-BigQueryconnector version 0.14.0 or greater,or the Hudi component in Dataproc 2.1, which has the appropriate version of the connector backported.In the previous versions, the connector created views on the manifest files whichwasn't optimal for query performance. If you are using the previous versionof the connector, then you must drop the existing view that represents theHudi table in BigQuery to avoid schemamismatch error.Toquery Hudi data,follow these steps:
- Create an external table based onHudi data.
- Upgrade the external table to BigLake.
Create Hudi external tables
When you sync tables by using the sync tool for Hudi andBigQuery, enable theuse-bq-manifest-file flag to transitionto the manifest file approach. This flag also exports a manifest file in aformat supported by BigQuery and uses it to create an externaltable with the name specified in the--table parameter.
To create a Hudi external table, follow these steps:
To create a Hudi external table,submit a job to an existingDataproc cluster. When you build theHudi-BigQuery connector, enable the
use-bq-manifest-fileflagto transition to the manifest file approach. This flag exports amanifest file in a format supported by BigQuery and uses it tocreate an external table with the name specified in the--tableparameter.spark-submit \ --master yarn \ --packages com.google.cloud:google-cloud-bigquery:2.10.4 \ --class org.apache.hudi.gcp.bigquery.BigQuerySyncTool \JAR \ --project-idPROJECT_ID \ --dataset-nameDATASET \ --dataset-locationLOCATION \ --tableTABLE \ --source-uriURI \ --source-uri-prefixURI_PREFIX \ --base-pathBASE_PATH \ --partitioned-byPARTITION_BY \ --use-bq-manifest-file
Replace the following:
JAR: If you are using the Hudi-BigQuery connector, specifyhudi-gcp-bundle-0.14.0.jar. If you are usingthe Hudi component in Dataproc 2.1, specify/usr/lib/hudi/tools/bq-sync-tool/hudi-gcp-bundle-0.12.3.1.jarPROJECT_ID: the project ID in which you want tocreate the Hudi BigLake tableDATASET: the dataset in which you want to createthe Hudi BigLake tableLOCATION: the location in which you want to createthe Hudi BigLake tableTABLE: the name of the table that you want to createIf you are transitioning from the earlier version of theHudi-BigQueryconnector (0.13.0 and earlier) that created views onthe manifest files, ensure that you use the same table name as it lets youkeep the existing downstream pipeline code.
URI: the Cloud Storage URI that you createdto store the Hudi manifest fileThis URI points to the first level partition; make sure to include thepartition key. For example,
gs://mybucket/hudi/mydataset/EventDate=*URI_PREFIX: the prefix for the Cloud StorageURI path, usually it's the path to Hudi tablesBASE_PATH: the base path for HuditablesFor example,
gs://mybucket/hudi/mydataset/PARTITION_BY: the partition valueFor example,
EventDate
For more information about the connector's configuration, seeHudi-BigQuery connector.
To set appropriate fine-grained controls or to accelerate the performanceby enabling metadata caching, seeUpgrade BigLake tables.
Query Delta workloads
Delta tables are nownatively supported. We recommend creating Delta BigLake tables for Delta workloads. Delta Lake BigLake tables support more advancedDelta Lake tables, including tables with column remapping and deletion vectors. Additionally, Delta BigLake tables directly read the latest snapshot, so updates are instantly available.
Toquery Delta workloads,follow these steps:
- Generate a manifest file.
- Create a BigLake tablebased on the manifest file.
- Set appropriate fine-grained controls or accelerate the performanceby enabling metadata caching. To do this, seeUpgrade BigLake tables.
Generate a manifest file
BigQuery supports the manifest file in aSymLinkTextInputFormat format, which is anewline-delimited list of URIs. For more information about generating a manifestfile, seeSet up Presto to Delta Lake integration and query Delta tables.
To generate a manifest file,submit a jobto an existing Dataproc cluster:
SQL
Using Spark, run the following command on a Delta table at locationpath-to-delta-table:
GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`
Scala
Using Spark, run the following command on a Delta table at locationpath-to-delta-table:
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)deltaTable.generate("symlink_format_manifest")Java
Using Spark, run the following command on a Delta table at locationpath-to-delta-table:
DeltaTable deltaTable = DeltaTable.forPath(<path-to-delta-table>);deltaTable.generate("symlink_format_manifest");Python
Using Spark, run the following command on a Delta table at locationpath-to-delta-table:
deltaTable = DeltaTable.forPath(<path-to-delta-table>)deltaTable.generate("symlink_format_manifest")Create Delta BigLake tables
To create a Delta BigLake table, use theCREATE EXTERNAL TABLE statement with thefile_set_spec_type field set toNEW_LINE_DELIMITED_MANIFEST:
Go to theBigQuery page.
In the query editor, run the
CREATE EXTERNAL TABLEstatement:CREATEEXTERNAL TABLEPROJECT_ID.DATASET_NAME.TABLE_NAMEWITHPARTITION COLUMNS(`PARTITION_COLUMN PARTITION_COLUMN_TYPE`,)WITHCONNECTION `PROJECT_IDREGION.CONNECTION_NAME`OPTIONS ( format = "DATA_FORMAT", uris = ["URI"], file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST', hive_partition_uri_prefix = "PATH_TO_DELTA_TABLE" max_staleness =STALENESS_INTERVAL, metadata_cache_mode = 'CACHE_MODE');
Replace the following:
DATASET_NAME: the name of the dataset you createdTABLE_NAME: the name you want to give to this tableREGION: the location where the connection is located (for example,us-east1)CONNECTION_NAME: the name of the connection you createdDATA_FORMAT: any of the supportedformats (such asPARQUET)URI: the path to the manifest file (for example,gs://mybucket/path)PATH_TO_DELTA_TABLE: a common prefix for all source URIs before the partition key encoding beginsSTALENESS_INTERVAL: specifies whether cached metadata is used by operations against the BigLake table, and how fresh the cached metadata must be in order for the operation to use it. For more information about metadata caching considerations, seeMetadata caching for performance.To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify aninterval literal value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOURfor a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Delta Lake instead.CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually. For more information about metadata caching considerations, seeMetadata caching for performance.Set to
AUTOMATICfor the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.Set to
MANUALif you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHEsystem procedure to refresh the cache.You must set
CACHE_MODEifSTALENESS_INTERVALis set to a value greater than 0.
Example:
CREATEEXTERNALTABLEmydataset.mytableWITHCONNECTION`us-east1.myconnection`OPTIONS(format="PARQUET",uris=["gs://mybucket/path/partitionpath=*"],file_set_spec_type='NEW_LINE_DELIMITED_MANIFEST'hive_partition_uri_prefix="gs://mybucket/path/"max_staleness=INTERVAL1DAY,metadata_cache_mode='AUTOMATIC');
Upgrade BigLake tables
You can alsoaccelerate the performance of your workloads by taking advantage ofmetadata caching andmaterialized views. If you want to usemetadata caching, you can specify settings for this atthe same time. To get table details such as source format and source URI, seeGet table information.
To update an external table to a BigLake table or update anexisting BigLake, select one of the following options:
SQL
Use theCREATE OR REPLACE EXTERNAL TABLE DDL statementto update a table:
In the Google Cloud console, go to theBigQuery page.
In the query editor, enter the following statement:
CREATEORREPLACEEXTERNALTABLE`PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`WITHCONNECTION{`REGION.CONNECTION_ID`|DEFAULT}OPTIONS(format="TABLE_FORMAT",uris=['BUCKET_PATH'],max_staleness=STALENESS_INTERVAL,metadata_cache_mode='CACHE_MODE');
Replace the following:
PROJECT_ID: the name of the project that contains the tableDATASET: the name of the dataset that contains the tableEXTERNAL_TABLE_NAME: the name of the tableREGION: the region that contains the connectionCONNECTION_ID: the name of the connection to useTo use a default connection, specify
DEFAULTinstead of the connection string containingREGION.CONNECTION_ID.TABLE_FORMAT: the format used by the tableYou can't change this when updating the table.
BUCKET_PATH: the path to the Cloud Storage bucket that contains the data for the external table, in the format['gs://bucket_name/[folder_name/]file_name'].You can select multiple files from the bucket by specifying one asterisk (
*) wildcard character in the path. For example,['gs://mybucket/file_name*']. For more information, seeWildcard support for Cloud Storage URIs.You can specify multiple buckets for the
urisoption by providing multiple paths.The following examples show valid
urisvalues:['gs://bucket/path1/myfile.csv']['gs://bucket/path1/*.csv']['gs://bucket/path1/*', 'gs://bucket/path2/file00*']
When you specify
urisvalues that target multiple files, all of those files must share a compatible schema.For more information about using Cloud Storage URIs in BigQuery, seeCloud Storage resource path.
STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the table, and how fresh the cached metadata must be in order for the operation to use itFor more information about metadata caching considerations, seeMetadata caching for performance.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify aninterval literal value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOURfor a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manuallyFor more information on metadata caching considerations, seeMetadata caching for performance.
Set to
AUTOMATICfor the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.Set to
MANUALif you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHEsystem procedure to refresh the cache.You must set
CACHE_MODEifSTALENESS_INTERVALis set to a value greater than 0.
ClickRun.
For more information about how to run queries, seeRun an interactive query.
bq
Use thebq mkdef andbq update commandsto update a table:
Generate anexternal table definition,that describes the aspects of the table to change:
bqmkdef--connection_id=PROJECT_ID.REGION.CONNECTION_ID\--source_format=TABLE_FORMAT\--metadata_cache_mode=CACHE_MODE\"BUCKET_PATH">/tmp/DEFINITION_FILE
Replace the following:
PROJECT_ID: the name of the project thatcontains the connectionREGION: the region that contains theconnectionCONNECTION_ID: the name of the connectionto useTABLE_FORMAT: the format used by thetable. You can't change this when updating the table.CACHE_MODE: specifies whether the metadatacache is refreshed automatically or manually. For more informationon metadata caching considerations, seeMetadata caching for performance.Set to
AUTOMATICfor the metadata cache to be refreshed at asystem-defined interval, usually somewhere between 30 and60 minutes.Set to
MANUALif you want to refresh the metadata cache on aschedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHEsystem procedure to refreshthe cache.You must set
CACHE_MODEifSTALENESS_INTERVALis set to a valuegreater than 0.BUCKET_PATH: the path to theCloud Storage bucket that contains the data for theexternal table, in the formatgs://bucket_name/[folder_name/]file_name.You can limit the files selected from the bucket by specifying one asterisk (
*)wildcard character in the path. For example,gs://mybucket/file_name*. For moreinformation, seeWildcard support for Cloud Storage URIs.You can specify multiple buckets for the
urisoption by providing multiplepaths.The following examples show valid
urisvalues:gs://bucket/path1/myfile.csvgs://bucket/path1/*.csvgs://bucket/path1/*,gs://bucket/path2/file00*
When you specify
urisvalues that target multiple files, all of thosefiles must share a compatible schema.For more information about using Cloud Storage URIs inBigQuery, seeCloud Storage resource path.
DEFINITION_FILE: the name of the tabledefinition file that you are creating.
Update the table using the new external table definition:
bqupdate--max_staleness=STALENESS_INTERVAL\--external_table_definition=/tmp/DEFINITION_FILE\PROJECT_ID:DATASET.EXTERNAL_TABLE_NAME
Replace the following:
STALENESS_INTERVAL: specifies whethercached metadata is used by operations against thetable, and how fresh the cached metadata must be in order forthe operation to use it. For more information about metadatacaching considerations, seeMetadata caching for performance.To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an interval value between 30minutes and 7 days, using the
Y-M D H:M:Sformat described in theINTERVALdata typedocumentation. For example, specify0-0 0 4:0:0for a 4hour staleness interval.With this value, operations against the table use cached metadata ifit has been refreshed within the past 4 hours. If the cached metadatais older than that, the operation retrieves metadata fromCloud Storage instead.DEFINITION_FILE: the name of the tabledefinition file that you created or updated.PROJECT_ID: the name of the projectthat contains the tableDATASET: the name of the dataset thatcontains the tableEXTERNAL_TABLE_NAME: the name of the table
Query BigLake and external tables
After creating a BigLake table, you canquery it using GoogleSQL syntax, thesame as if it were a standard BigQuery table.For example,SELECT field1, field2 FROM mydataset.my_cloud_storage_table;.
Limitations
BigQuery only supports queryingDelta Lake reader v1tables.
Hudi and BigQuery integration onlyworks for hive-style partitioned
copy-on-writetables.
What's next
- Learn aboutusing SQL in BigQuery.
- Learn aboutBigLake tables.
- Learn aboutBigQuery quotas.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.