Metadata caching for external tables

Important: The term "BigLake" on this page refers to an accessdelegation functionality for external tables in BigQuery. Forinformation about BigLake, the stand-alone Google Cloudproduct that includes BigLake metastore, the Apache Iceberg REST catalog,and BigLake tables for Apache Iceberg seeBigLake overview.

This document describes how to use metadata caching (also known ascolumn metadata indexing) to improve query performance on object tables andsome types of BigLake tables.

Object tables and some types of BigLake tables can cache metadatainformation about files in external datastores—for example,Cloud Storage. The following types of BigLake tablessupport metadata caching:

  • Amazon S3 BigLake tables
  • Cloud Storage BigLake tables

The metadata includes file names, partitioning information, and metadata forfiles such as row counts. You can choose whether to enable metadatacaching on a table. Queries with a large number of files and withHive partition filters benefit the most from metadatacaching.

If you don't enable metadata caching, queries on the table must read theexternal data source to get object metadata. Reading this data increases thequery latency; listing millions of files from the external data source can takeseveral minutes. If you enable metadata caching, queries can avoid listing filesfrom the external data source and can partition and prune files more quickly.

You can enable metadata caching on a BigLake or object tablewhen you create the table. For more information about creating object tables,seeCreate object tables. For more information aboutcreating BigLake tables, see one of the following topics:

Metadata caching settings

There are two properties that control the behavior of this feature:

  • Maximum staleness specifies when queries use cached metadata.
  • Metadata cache mode specifies how the metadata is collected.

When you have metadata caching enabled, you specify the maximum interval ofmetadata staleness that is acceptable for operations against the table. Forexample, if you specify an interval of 1 hour, then operations against the tableuse cached metadata if it has been refreshed within the past hour. If the cachedmetadata is older than that, the operation falls back to retrieving metadatafrom the datastore (Amazon S3 or Cloud Storage) instead. Youcan specify a staleness interval between 30 minutes and 7 days.

You can choose to refresh the cache either automatically or manually:

  • For automatic refreshes, the cache is refreshed at a system definedinterval, usually somewhere between 30 and 60 minutes. Refreshing thecache automatically is a good approach if the files inthe datastore are added, deleted, or modified at randomintervals. If you need to control the timing of the refresh, for example totrigger the refresh at the end of an extract-transform-load job, usemanual refresh.
  • For manual refreshes, you run theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedureto refresh the metadata cache on whatever schedule you determine.For BigLake tables, you can refresh the metadata selectivelyby providing subdirectories of the table data directory. This approach letsyou avoid unnecessary metadata processing. Refreshing the cache manually isa good approach if the files in the datastore are added, deleted, ormodified at known intervals—for example, as the output of a pipeline.

Both manual and automatic cache refreshes are executed withINTERACTIVE query priority.

If you choose to use automatic refreshes, we recommend that you create areservation, and then create anassignment with aBACKGROUND job typefor the project that runs the metadata cache refresh jobs. This prevents therefresh jobs from competing with user queries for resources, andpotentially failing if there aren't sufficient resources available for them.

You should consider how the staleness interval and metadata caching modevalues will interact before you set them. Consider the following examples:

  • If a table's metadata cache is set to require manual refreshes, and thestaleness interval is set to 2 days, you must run theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure every 2 days or lessif you want operations against the table to use cached metadata.
  • If a table's metadata cache is set to refresh automatically, and thestaleness interval is set to 30 minutes, some of the operationsagainst the table might read from the datastore if themetadata cache refresh takes on the longer side of the usual30- to 60-minute window.

For more information on setting metadata caching options forBigLake tables, seeCreate Amazon S3 BigLake external tablesorCreate BigLake external tables for Cloud Storage.

For more information on setting metadata caching options for object tables, seeCreate object tables.

Get information on metadata cache refresh jobs

To find information about metadata cache refresh jobs, query theINFORMATION_SCHEMA.JOBS view,as shown in the following example:

SELECT*FROM`region-us.INFORMATION_SCHEMA.JOBS`WHEREjob_idLIKE'%metadata_cache_refresh%'ANDcreation_time>TIMESTAMP_SUB(CURRENT_TIMESTAMP(),INTERVAL6HOUR)ORDERBYstart_timeDESCLIMIT10;

Use customer-managed encryption keys with cached metadata

Cached metadata is protected by thecustomer-managed encryption key (CMEK)used for the table that the cached metadata isassociated with. This might be a CMEK applied directly to the table, ora CMEK that the table inherits from the dataset or project.

If a default CMEK is set for the project or dataset, or if the existing CMEK forthe project or dataset is changed, this doesn't affect existing tables ortheir cached metadata. You mustchange the key for the tableto apply the new key to both the table and its cached metadata.

CMEKs created in BigQuery don't apply to theCloud Storage files that are used by BigLake andobject tables. To obtain end-to-end CMEK encryption,configure CMEKs in Cloud Storagefor those files.

Get information on metadata cache usage by query jobs

To get information about metadata cache usage for a query job, call thejobs.get method for that joband look at theMetadataCacheStatistics fieldin theJobStatistics2 sectionof theJob resource. This field provides information on which metadatacache-enabled tables were used by the query, whether the metadata cache wasused by the query, and if not, the reason why not.

Table statistics

For BigLake tables that are based on Parquet files, tablestatistics are collected when the metadata cache is refreshed. Table statisticcollection happens during both automatic and manual refreshes, and thestatistics are kept for the same period as the metadata cache.

The table statistics collected include file information like row counts,physical and uncompressed file sizes, and cardinality of columns. When you runa query on a Parquet-based BigLake table, these statisticsare supplied to the query optimizer to enable better query planning andpotentially improve query performance for some types of queries. For example,a common query optimization is dynamic constraint propagation, where thequery optimizer dynamically infers predicates on the larger fact tables in ajoin from the smaller dimension tables. While this optimization can speed upqueries by using normalized table schemas, it requires accurate tablestatistics. The table statistics collected by metadata caching enable greateroptimization of query plans in both BigQuery andApache Spark.

Limitations

The following limitations apply to the metadata cache:

  • If you issue multiple concurrent manual refreshes, only one will succeed.
  • The metadata cache expires after 7 days if it isn't refreshed.
  • If you update the source URI for a table, the metadata cache is notautomatically refreshed, and subsequent queries return data from the outdatedcache. To avoid this, refresh the metadata cache manually. If the table'smetadata cache is set to refresh automatically, you must change the table'srefresh mode to manual, perform the manual refresh, then set the table'srefresh mode back to automatic again.
  • If you are manually refreshing the metadata cache, and your target datasetand Cloud Storage bucket are in aregional location, you must explicitlyspecify this location when you run theBQ.REFRESH_EXTERNAL_METADATA_CACHEprocedure call. You can do this one of the following ways:

    Console

    1. Go to theBigQuery page.

      Go to BigQuery

    2. Select a tab in the Editor.

    3. ClickMore, and thenclickQuery settings.

    4. In theLocation section, unselect theAutomatic location selection checkbox, and then specify thetarget region.

    5. ClickSave.

    6. Run the query containing theBQ.REFRESH_EXTERNAL_METADATA_CACHE procedure call in that Editor tab.

    bq

    If you run the query containing theBQ.REFRESH_EXTERNAL_METADATA_CACHEprocedure call by usingbq query, be sureto specify the--location flag.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.