Introduction to BigLake external tables

Important: The term "BigLake" on this page refers to an accessdelegation functionality for external tables in BigQuery. Forinformation about BigLake, the stand-alone Google Cloudproduct that includes BigLake metastore, the Apache Iceberg REST catalog,and BigLake tables for Apache Iceberg seeBigLake overview.

This document provides an overview of BigLake and assumes familiaritywith database tables and Identity and Access Management (IAM). To query data stored in thesupported data stores, you must first createBigLake tables and then query them using GoogleSQLsyntax:

You can also upgrade an external table to BigLake. For moreinformation, seeUpgrade an external table to BigLake.

BigLake tables let you query structured data inexternal data stores with access delegation. Access delegationdecouples access to the BigLake table from access tothe underlying data store. Anexternal connectionassociated with a service account is used to connect to the data store. Becausethe service account handles retrieving data from the data store, you only haveto grant users access to the BigLake table. This lets you enforcefine-grained security at the table level, includingrow-level andcolumn-level security. ForBigLake tables based on Cloud Storage, you can also usedynamic data masking. To learn more aboutmulti-cloud analytic solutions using BigLake tables withAmazon S3 or Blob Storage data, seeBigQuery Omni.

Supported data stores

You can use BigLake tables with the following data stores:

Temporary table support

BigLake tables based on Cloud Storage can be temporaryor permanent. BigLake tables based on Amazon S3 orBlob Storage must be permanent.

Multiple source files

You can create a BigLake table based on multiple external datasources, provided those data sources have the same schema.

Cross-cloud joins

Cross-cloud joins let you run queries that span both Google Cloud andBigQuery Omni regions. You can useGoogleSQLJOIN operationsto analyze data across many different storage solutions, such as AWS, Azure,public datasets, and other Google Cloud services. Cross-cloud joinseliminate the need to copy data across sources before running queries.

You can reference BigLake tables anywhere in aSELECT statementas if they were standard BigQuery tables, including indata manipulation language (DML)anddata definition language (DDL)statements thatuse subqueries to retrieve data. You can use multiple BigLaketables from different clouds and BigQuery tables in the samequery. All BigQuery tables must be from the same region.

Cross-cloud join required permissions

To get the permissions that you need to run a cross-cloud join, ask your administrator to grant you the following IAM roles on the project where the join is executed:

For more information about granting roles, seeManage access to projects, folders, and organizations.

These predefined roles contain the permissions required to run a cross-cloud join. To see the exact permissions that are required, expand theRequired permissions section:

Required permissions

The following permissions are required to run a cross-cloud join:

  • bigquery.jobs.create
  • bigquery.tables.getData

You might also be able to get these permissions withcustom roles or otherpredefined roles.

Cross-cloud join costs

When you run a cross-cloud join operation, BigQuery parses thequery into local and remote parts. The local part is treated as a standard queryin the BigQuery region. The remote part is converted into aCREATE TABLE AS SELECT (CTAS) operation on the referencedBigLake table in the BigQuery Omni region, whichcreates a temporary table in your BigQuery region.BigQuery then uses this temporary table to execute yourcross-cloud join and deletes the table automatically after eight hours.

You incur data transfer costs for data in the referenced BigLaketables. However, BigQuery helps reduce these costs by onlytransferring columns and rows in the BigLake table that arereferenced in the query, rather than the entire table. We recommend specifying acolumn filter that is as narrow as possible to further reduce transfer costs.The CTAS job appears in your job history and displays information such as thenumber of transferred bytes. Successful transfers incur costs even if the mainquery job fails. For more information, seeBigQuery Omni pricing.

Consider the following query as an example:

SELECT*FROMbigquery_dataset.bigquery_tableASclientsWHEREclients.sales_repIN(SELECTidFROMaws_dataset.aws_table1ASemployeesINNERJOINaws_dataset.aws_table2ASactive_employeesONemployees.id=active_employees.idWHEREemployees.level>3);

This example has two transfers: one from an employees table (with a levelfilter) and one from an active employees table. The join is performed in theBigQuery region after the transfer occurs. If one transfer failsand the other succeeds, data transfer charges are still applied for thesuccessful transfer.

Cross-cloud join limitations

  • Cross-cloud joins aren't supported in the BigQueryfree tier and in theBigQuery sandbox.
  • Aggregations might not be pushed down to the BigQuery Omniregions if the query containsJOIN statements.
  • Each temporary table is only used for a single cross-cloud query and is notreused even if the same query is repeated multiple times.
  • The transfer size limit for each transfer is 60 GB. Specifically, if you applya filter on a BigLake table and load the result, it must besmaller than 60 GB. If needed, you canrequest a quota adjustment.There is no limit on scanned bytes.
  • Cross-cloud join queries employ an internal quota on the rate of queries. Ifthe rate of queries exceeds the quota, you might receive anAll our servers are busy processing data transferred between regions error.Retrying the query should work in most cases. Contact support to increase theinternal quota to support a higher rate of queries.
  • Cross-cloud joins are only supported incolocated BigQuery regionswith their corresponding BigQuery Omni regions and in theUSandEU multi-regions. Cross-cloud joins that are run in theUS orEUmulti-regions can only access data in US or EU BigQuery Omniregions respectively.
  • If a cross-cloud join query references 10 or more datasets fromBigQuery Omni regions, it might fail with an errorNot found: Dataset <BigQuery dataset> was not found inlocation <BigQuery Omni region>. To avoid this issue, werecommendexplicitly specifying a locationwhen you run a cross-cloud join that references more than 10 datasets. Beaware that if you explicitly specify a BigQuery region and yourquery only contains BigLake tables, then your query is run as across-cloud query and incurs data transfer costs.
  • You can'tquery the_FILE_NAME pseudo-columnwith cross-cloud joins.
  • When you reference the columns of a BigLake table in aWHEREclause, you can't useINTERVAL orRANGE literals.
  • Cross-cloud join jobs don't report the number of bytes that are processed andtransferred from other clouds. This information is available in the child CTASjobs that are created as part of cross-cloud query execution.
  • Authorized views andauthorized routines referencingBigQuery Omni tables or views are only supported inBigQuery Omni regions.
  • If your cross-cloud query referencesSTRUCT orJSON columns, no pushdownsare applied to any remote subqueries. To optimize performance, considercreating a view in the BigQuery Omni region that filtersSTRUCT andJSON columns and returns only the necessary fields asindividual columns.
  • Collation isn'tsupported by cross-cloud joins.

Cross-cloud join examples

The following query joins anorders table in a BigQuery regionwith alineitem table in a BigQuery Omni region:

SELECTl_shipmode,o_orderpriority,count(l_linenumber)ASnum_lineitemsFROMbigquery_dataset.ordersJOINaws_dataset.lineitemONorders.o_orderkey=lineitem.l_orderkeyWHEREl_shipmodeIN('AIR','REG AIR')ANDl_commitdate<l_receiptdateANDl_shipdate<l_commitdateANDl_receiptdate>=DATE'1997-01-01'ANDl_receiptdate<DATE'1997-02-01'GROUPBYl_shipmode,o_orderpriorityORDERBYl_shipmode,o_orderpriority;

This query is broken into local and remote parts. The following query is sent tothe BigQuery Omni region to execute first. The result is atemporary table in the BigQuery region. You can view this childCTAS job and its metadata in your job history.

CREATEORREPLACETABLEtemp_tableAS(SELECTl_shipmode,l_linenumber,l_orderkeyFROMaws_dataset.lineitemWHEREl_shipmodeIN('AIR','REG AIR')ANDl_commitdate<l_receiptdateANDl_shipdate<l_commitdateANDl_receiptdate>=DATE'1997-01-01'ANDl_receiptdate<DATE'1997-02-01');

After the temporary table is created, theJOIN operation completes, and thefollowing query is run:

SELECTl_shipmode,o_orderpriority,count(l_linenumber)ASnum_lineitemsFROMbigquery_dataset.ordersJOINtemp_tableONorders.o_orderkey=lineitem.l_orderkeyGROUPBYl_shipmode,o_orderpriorityORDERBYl_shipmode,o_orderpriority;

As another example, consider the following cross-cloud join:

SELECTc_mktsegment,c_nameFROMbigquery_dataset.customerWHEREc_mktsegment='BUILDING'UNIONALLSELECTc_mktsegment,c_nameFROMaws_dataset.customerWHEREc_mktsegment='FURNITURE'LIMIT10;

In this query, theLIMIT clause is not pushed down to theBigQuery Omni region. All customers in theFURNITURE marketsegment are transferred to the BigQuery region first, and thenthe limit of 10 is applied.

Connectors

You can access data in BigLake tables based onCloud Storage from other data processing tools by usingBigQuery connectors.For example, you could access data in BigLake tables fromApache Spark,Apache Hive,TensorFlow,Trino, orPresto.The BigQuery Storage API enforces row- and column-level governance policies on alldata access to BigLake tables, including through connectors.

For example, the following diagram demonstrates howtheBigQuery Storage APIlets users access authorized data using open source query engines such as ApacheSpark:

BigLake architecture.

For more information about connectors supported by BigQuery, seeBigQuery connectors.

BigLake tables on object stores

For data lake administrators, BigLake lets you set accesscontrols on tables rather than files, which gives you finer-grained optionswhen setting user access to data in the data lake.

Because BigLake tables simplifies access control in this way,we recommend using BigLake tables to build and maintainconnections to external object stores.

You can useexternal tables in cases wheregovernance is not a requirement, or for ad hoc data discovery and manipulation.

Limitations

Security model

The following organizational roles are typically involved in managing andusing BigLake tables:

  • Data lake administrators. These administrators typically manageIdentity and Access Management (IAM) policies on Cloud Storage buckets andobjects.
  • Data warehouse administrators. These administrators typicallycreate, delete, and update tables.
  • Data analysts. Analysts typically read data and run queries.

Data lake administrators are responsible for creating connections and sharingthem with data warehouse administrators. In turn, data warehouseadministrators create tables, set appropriate access controls, and share thetables with data analysts.

Caution: Data analysts shouldnot have thefollowing:

Metadata caching for performance

You can use cached metadata to improve query performance on some types ofBigLake tables. Metadata caching is especially helpful in caseswhere you are working with large numbers of files or if the data is hivepartitioned. The following types of BigLake tablessupport metadata caching:

  • Amazon S3 BigLake tables
  • Cloud Storage BigLake tables
BigQuery uses CMETA as a distributed metadata system to handlelarge tables efficiently. CMETA provides fine-grained metadata at the column andblock level, accessible through system tables. This system helps improve queryperformance by optimizing data access and processing. To further acceleratequery performance on large tables, BigQuery maintains a metadatacache. CMETA refresh jobs keep this cache up-to-date.

The metadata includes file names, partitioning information, and physicalmetadata from files such as row counts. You can choose whether or not to enablemetadata caching on a table. Queries with a large number of files and withApache Hive partition filters benefit the most from metadata caching.

If you don't enable metadata caching, queries on the table must read theexternal data source to get object metadata. Reading this data increases thequery latency; listing millions of files from the external data source can takeseveral minutes. If you enable metadata caching, queries can avoid listing filesfrom the external data source and can partition and prune files more quickly.

Metadata caching also integrates with Cloud Storage object versioning. When the cache is populated or refreshed, it captures metadata based on the live version of the Cloud Storage objects at that time. As a result, metadata caching-enabled queries read data corresponding to the specific cached object version, even if newer versions become live in Cloud Storage. Accessing data from any subsequently updated object versions in Cloud Storage necessitates a metadata cache refresh.

There are two properties that control this feature:

  • Maximum staleness specifies when queries use cached metadata.
  • Metadata cache mode specifies how the metadata is collected.

When you have metadata caching enabled, you specify the maximum interval ofmetadata staleness that is acceptable for operations against the table. Forexample, if you specify an interval of 1 hour, then operations against the tableuse cached metadata if it has been refreshed within the past hour. If the cachedmetadata is older than that, the operation falls back to retrieving metadatafromthe datastore (Amazon S3 or Cloud Storage) instead.You can specify a staleness interval between 30 minutes and 7 days.

When you enable metadata caching for BigLake or object tables, BigQuery triggers metadata generation refresh jobs. You can choose to refresh the cache either automatically or manually:

  • For automatic refreshes, the cache is refreshed at a system definedinterval, usually somewhere between 30 and 60 minutes. Refreshing thecache automatically is a good approach if the files inthe datastoreare added, deleted, or modified at randomintervals. If you need to control the timing of the refresh, for example totrigger the refresh at the end of an extract-transform-load job, usemanual refresh.
  • For manual refreshes, you run theBQ.REFRESH_EXTERNAL_METADATA_CACHE systemprocedureto refresh the metadata cache on a schedule that meets your requirements.For BigLake tables, you can refresh the metadata selectivelyby providing subdirectories of the table data directory. This lets youavoid unnecessary metadataprocessing.Refreshing the cache manually is a good approach if the files inthe datastoreare added, deleted, or modified at known intervals,for example as the output of a pipeline.

    If you issue multiple concurrent manual refreshes, only one will succeed.

The metadata cache expires after 7 days if it isn't refreshed.

Both manual and automatic cache refreshes are executed withINTERACTIVE query priority.

UseBACKGROUND reservations

If you choose to use automatic refreshes, we recommend that you create areservation, and then create anassignment with aBACKGROUND job typefor the project that runs the metadata cache refresh jobs. WithBACKGROUND reservations, refresh jobs use a dedicated resource pool which prevents the refresh jobs from competing with user queries, and prevents the jobs from potentially failing if there aren't sufficient resources available for them.

While using a shared slot pool incurs no extra cost, usingBACKGROUND reservations instead provides more consistent performance by allocating a dedicated resource pool, and improves the reliability of refresh jobs and overall query efficiency in BigQuery.

You should consider how the staleness interval and metadata caching modevalues will interact before you set them. Consider the following examples:

  • If you are manually refreshing the metadata cache for a table, and you setthe staleness interval to 2 days, you must run theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure every 2 days or lessif you want operations against the table to use cached metadata.
  • If you are automatically refreshing the metadata cache for a table, and youset the staleness interval to 30 minutes, it is possible that some of youroperations against the table might read fromthe datastoreif the metadata cache refresh takes on the longer side of the usual30 to 60 minute window.

To find information about metadata refresh jobs, query theINFORMATION_SCHEMA.JOBS view,as shown in the following example:

SELECT*FROM`region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`WHEREjob_idLIKE'%metadata_cache_refresh%'ANDcreation_time>TIMESTAMP_SUB(CURRENT_TIMESTAMP(),INTERVAL6HOUR)ORDERBYstart_timeDESCLIMIT10;

For Cloud Storage BigLake tables that are based on Parquetfiles,table statistics arecollected during the metadata cache refresh and used to improve query plans.

To learn more, seeMetadata caching.

For more information on setting metadata caching options, seeCreate Amazon S3 BigLake tablesorCreate Cloud Storage BigLake tables.

Cache-enabled tables with materialized views

You can usematerialized views over BigLake metadata cache-enabledtables to improveperformance and efficiency when querying structured data stored inCloud Storage or Amazon Simple Storage Service (Amazon S3).These materialized views function like materialized views overBigQuery-managed storage tables, including the benefits ofautomatic refreshandsmart tuning.

Integrations

BigLake tables are accessible from a number of other BigQuery featuresand gcloud CLI services, including the following, highlighted services.

BigQuery sharing (formerly Analytics Hub)

BigLake tables are compatible with Sharing.Datasets containing BigLake tables can be published asSharing listings.Sharing subscribers can subscribe to theselistings, which provision a read-only dataset, called alinkeddataset, intheir project.Subscribers can query all tables in the linked dataset, including allBigLake tables. For more information, seeView and subscribe tolistings.

BigQuery ML

You can useBigQuery ML to train andrun models on BigLake in Cloud Storage.

Sensitive Data Protection

Sensitive Data Protection scans your BigLake tablesto identify and classify sensitive data. If sensitive data is detected,Sensitive Data Protection de-identification transformations canmask, delete, or otherwise obscurethat data.

Costs

Costs are associated with the following aspects of BigLake tables:

If you haveslotreservations, you are notcharged for querying external tables. Instead, slots are consumed for thesequeries.

The following table shows how your pricing model affects how these costs areapplied:


On-demand pricing

Standard, Enterprise, and Enterprise Plus editions

Queries

You arebilled for the bytesprocessed by user queries.

Slots inreservation assignmentswith aQUERY job type are consumed during query time.

Manually refreshing the metadata cache.

You arebilled for the bytesprocessed to refresh the cache.

Slots inreservation assignmentswith aQUERY job type are consumed during cache refresh.

Automatically refreshing the metadata cache.

You arebilled for the bytesprocessed to refresh the cache.

Slots inreservation assignmentswith aBACKGROUND job type are consumed during cache refresh.

If there are noBACKGROUND reservations available for refreshingthe metadata cache, BigQuery automatically uses slots inQUERY reservations instead if you are using the Enterprise or Enterprise Plus edition.

You are also charged for storage and data access byCloud Storage,Amazon S3,andAzure Blob Storage,subject to each product's pricing guidelines.

When BigQuery interacts with Cloud Storage, you mightincur the following Cloud Storage costs:

  • Data storage costs for the amount of data stored.
  • Data retrieval costs for accessing data inNearline,Coldline,andArchive storageclasses. Take caution when querying tables or refreshing the metadata cacheagainst these storage classes, as charges can be significant.
  • Network usage costs for data that you read across different regions, suchas when your BigQuery dataset and Cloud Storagebucket are in different regions.
  • Data processing charges. However, you aren't charged for API calls that aremade by BigQuery on your behalf, such as listing or gettingresources.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.