Create BigLake external tables forCloud Storage

Important: The term "BigLake" on this page refers to an accessdelegation functionality for external tables in BigQuery. Forinformation about BigLake, the stand-alone Google Cloudproduct that includes BigLake metastore, the Apache Iceberg REST catalog,and BigLake tables for Apache Iceberg seeBigLake overview.

This document describes how to create a Cloud Storage BigLake table.ABigLake table lets you useaccess delegation to query structured data in Cloud Storage. Accessdelegation decouples access to the BigLake table from accessto the underlying datastore.

Before you begin

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.create permission.Learn how to grant roles.
    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to project selector

  2. Verify that billing is enabled for your Google Cloud project.

  3. Enable the BigQuery Connection API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enable permission.Learn how to grant roles.

    Enable the API

    If you want to read BigLake tables from open source enginessuch as Apache Spark, then you need to enable theBigQuery Storage Read API.

  4. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

  5. Ensure that you have a BigQuerydataset.

  6. Ensure that your version of the Google Cloud SDK is 366.0.0 or later:

    gcloudversion

    If needed,update the Google Cloud SDK.

    1. Optional: For Terraform,terraform-provider-google version 4.25.0 orlater is required.terraform-provider-google releases are listed onGitHub.You can download the latest version of Terraform fromHashiCorp Terraform downloads.
  7. Create a Cloud resource connection or set up a default connection toyour external data source. Connections require additional roles andpermissions. For more information, seeCreate a Cloud resourceconnection and theDefault connection overview.

Required roles

To create a BigLake table, you need the followingBigQuery Identity and Access Management (IAM) permissions:

  • bigquery.tables.create
  • bigquery.connections.delegate

The BigQuery Admin (roles/bigquery.admin) predefinedIdentity and Access Management role includes these permissions.

If you are not a principal in this role, ask your administratorto grant you access or to create the BigLake table for you.

For more information on Identity and Access Management roles and permissions inBigQuery, seePredefined roles andpermissions.

Location consideration

When you use Cloud Storage to store data files, you can improveperformance by usingCloud Storagesingle-region ordual-region buckets instead ofmulti-region buckets.

Create BigLake tables on unpartitioned data

If you're familiar with creating tables in BigQuery, the processof creating a BigLake table is similar.Your table can use any file format that BigLake supports. Formore information, seeLimitations.

Before you create a BigLake table, you need to have adataset and aCloud resourceconnectionthat canaccess Cloud Storage.

To create a BigLake table, select one of the following options:

Console

  1. Go to theBigQuery page.

    Go to BigQuery

  2. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

    If you don't see the left pane, clickExpand left pane to open the pane.

  3. In theExplorer pane, expand your project, clickDatasets, andthen select a dataset.

  4. Expand theActions option and clickCreate table.

  5. In theSource section, specify the following details:

    1. ForCreate table from, selectGoogle Cloud Storage

    2. ForSelect file from GCS bucket or use a URI pattern, browse toselect a bucket and file to use, or type the path in the formatgs://bucket_name/[folder_name/]file_name.

      You can't specify multiple URIs in the Google Cloud console, butyou can select multiple files by specifying one asterisk (*)wildcard character. For example,gs://mybucket/file_name*. For moreinformation, seeWildcard support for Cloud Storage URIs.

      The Cloud Storage bucketmust be in the same location as the dataset that contains the tableyou're creating.

    3. ForFile format, select the format that matches your file.

  6. In theDestination section, specify the following details:

    1. ForProject, choose the project in which to create the table.

    2. ForDataset, choose the dataset in which to create the table.

    3. ForTable, enter the name of the table you are creating.

    4. ForTable type, selectExternal table.

    5. SelectCreate a BigLake table using a Cloud Resource connection.

    6. ForConnection ID, select the connection that you created earlier.

  7. In theSchema section, you can either enableschema auto-detection or manually specifya schema if you have a source file. If you don't have a source file, youmust manually specify a schema.

    • To enable schema auto-detection, select theAuto-detect option.

    • To manually specify a schema, leave theAuto-detect optionunchecked. EnableEdit as text and enter the table schema as aJSON array.

  8. To ignore rows with extra column values that do not match the schema,expand theAdvanced options section and selectUnknown values.

  9. ClickCreate table.

After the permanent table is created, you can run a query against the tableas if it were a native BigQuery table. After your querycompletes, you canexport the resultsas CSV or JSON files, save the resultsas a table, or save the results to Google Sheets.

SQL

Use theCREATE EXTERNAL TABLE DDL statement.You can specify the schema explicitly, or useschema auto-detection to infer the schemafrom the external data.

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATEEXTERNALTABLE`PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`WITHCONNECTION{`PROJECT_ID.REGION.CONNECTION_ID`|DEFAULT}OPTIONS(format="TABLE_FORMAT",uris=['BUCKET_PATH'[,...]],max_staleness=STALENESS_INTERVAL,metadata_cache_mode='CACHE_MODE');

    Replace the following:

    • PROJECT_ID: the name of your project in which you want to create the table—for example,myproject
    • DATASET: the name of the BigQuery dataset that you want to create the table in—for example,mydataset
    • EXTERNAL_TABLE_NAME: the name of the table that you want to create—for example,mytable
    • REGION: the region that contains the connection—for example,us
    • CONNECTION_ID: the connection ID—for example,myconnection

      When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that's shown inConnection ID—for exampleprojects/myproject/locations/connection_location/connections/myconnection.

      To use a default connection, specifyDEFAULT instead of the connection string containingPROJECT_ID.REGION.CONNECTION_ID.

    • TABLE_FORMAT: the format of the table that you want to create—for example,PARQUET

      For more information about supported formats, seeLimitations.

    • BUCKET_PATH: the path to the Cloud Storage bucket that contains the data for the external table, in the format['gs://bucket_name/[folder_name/]file_name'].

      You can select multiple files from the bucket by specifying one asterisk (*) wildcard character in the path. For example,['gs://mybucket/file_name*']. For more information, seeWildcard support for Cloud Storage URIs.

      You can specify multiple buckets for theuris option by providing multiple paths.

      The following examples show validuris values:

      • ['gs://bucket/path1/myfile.csv']
      • ['gs://bucket/path1/*.csv']
      • ['gs://bucket/path1/*', 'gs://bucket/path2/file00*']

      When you specifyuris values that target multiple files, all of those files must share a compatible schema.

      For more information about using Cloud Storage URIs in BigQuery, seeCloud Storage resource path.

    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the BigLake table, and how fresh the cached metadata must be in order for the operation to use it. For more information about metadata caching considerations, seeMetadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify aninterval literal value between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOUR for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.

    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually. For more information on metadata caching considerations, seeMetadata caching for performance.

      Set toAUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set toMANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a value greater than 0.

  3. ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

Option 1: Table definition file

Use thebq mkdef commandto create a table definition file, and then pass the path tothebq mk command as follows:

bqmkdef\--connection_id=CONNECTION_ID\--source_format=SOURCE_FORMAT\BUCKET_PATH >DEFINITION_FILEbqmk--table\--external_table_definition=DEFINITION_FILE\--max_staleness=STALENESS_INTERVAL\PROJECT_ID:DATASET.EXTERNAL_TABLE_NAME\SCHEMA

Replace the following:

  • CONNECTION_ID: the connection ID—forexample,myconnection

    When youview the connection detailsin the Google Cloud console, the connection ID is the value in the last section of thefully qualified connection ID that is shown inConnection ID—for exampleprojects/myproject/locations/connection_location/connections/myconnection.

    To use adefault connection,specifyDEFAULT instead of the connection string containingPROJECT_ID.REGION.CONNECTION_ID.

  • SOURCE_FORMAT: the format of the external data source.For example,PARQUET.

  • BUCKET_PATH: the path to theCloud Storage bucket that contains the data for thetable, in the formatgs://bucket_name/[folder_name/]file_pattern.

    You can select multiple files from the bucket by specifying one asterisk (*)wildcard character in thefile_pattern. For example,gs://mybucket/file00*.parquet. For moreinformation, seeWildcard support for Cloud Storage URIs.

    You can specify multiple buckets for theuris option by providing multiplepaths.

    The following examples show validuris values:

    • gs://bucket/path1/myfile.csv
    • gs://bucket/path1/*.parquet
    • gs://bucket/path1/file1*,gs://bucket1/path1/*

    When you specifyuris values that target multiple files, all of thosefiles must share a compatible schema.

    For more information about using Cloud Storage URIs inBigQuery, seeCloud Storage resource path.

  • DEFINITION_FILE: the path to thetable definition fileon your local machine.

  • STALENESS_INTERVAL: specifies whethercached metadata is used by operations against theBigLake table, andhow fresh the cached metadata must be in order for the operation touse it. For more information about metadata caching considerations, seeMetadata caching for performance.

    To disable metadata caching, specify 0. This is the default.

    To enable metadata caching, specify an interval value between 30minutes and 7 days, using theY-M D H:M:S format described in theINTERVAL data typedocumentation. For example, specify0-0 0 4:0:0 for a 4hour staleness interval.With this value, operations against the table use cached metadata ifit has been refreshed within the past 4 hours. If the cached metadatais older than that, the operation retrieves metadata fromCloud Storage instead.

  • DATASET: the name of theBigQuery dataset that you want to create a tablein—for example,mydataset

  • EXTERNAL_TABLE_NAME: the name of the tablethat you want to create—for example,mytable

  • SCHEMA: the schema for theBigLake table

Example:

bqmkdef--connection_id=myconnection--metadata_cache_mode=CACHE_MODE--source_format=CSV'gs://mybucket/*.csv'>mytable_defbqmk--table--external_table_definition=mytable_def='gs://mybucket/*.csv'--max_staleness=0-004:0:0myproject:mydataset.mybiglaketableRegion:STRING,Quarter:STRING,Total_sales:INTEGER

To use schema auto-detection, set the--autodetect=true flag in themkdef command and omit the schema:

bqmkdef\--connection_id=myconnection\--metadata_cache_mode=CACHE_MODE\--source_format=CSV--autodetect=true\gs://mybucket/*.csv>mytable_defbqmk\--table\--external_table_definition=mytable_def=gs://mybucket/*.csv\--max_staleness=0-004:0:0\myproject:mydataset.myexternaltable

Option 2: Inline table definition

Instead of creating a table definition file, you can pass the tabledefinition directly to thebq mk command.Use the@connection decorator to specify the connection to use at the endof the--external_table_definition flag.

bqmk--table\--external_table_definition=@SOURCE_FORMAT=BUCKET_PATH@projects/PROJECT_ID/locations/REGION/connections/CONNECTION_ID\DATASET_NAME.TABLE_NAME\SCHEMA

Replace the following:

  • SOURCE_FORMAT: the format of the external data source

    For example,CSV.

  • BUCKET_PATH: the path to theCloud Storage bucket that contains the data for thetable, in the formatgs://bucket_name/[folder_name/]file_pattern.

    You can select multiple files from the bucket by specifying one asterisk (*)wildcard character in thefile_pattern. For example,gs://mybucket/file00*.parquet. For moreinformation, seeWildcard support for Cloud Storage URIs.

    You can specify multiple buckets for theuris option by providing multiplepaths.

    The following examples show validuris values:

    • gs://bucket/path1/myfile.csv
    • gs://bucket/path1/*.parquet
    • gs://bucket/path1/file1*,gs://bucket1/path1/*

    When you specifyuris values that target multiple files, all of thosefiles must share a compatible schema.

    For more information about using Cloud Storage URIs inBigQuery, seeCloud Storage resource path.

  • PROJECT_ID: the name of yourproject in which you want to create the table—for example,myproject

  • REGION: the region that contains theconnection,us

  • CONNECTION_ID: the connection ID—forexample,myconnection

    When youview the connection detailsin the Google Cloud console, the connection ID is the value in the last section of thefully qualified connection ID that is shown inConnection ID—for exampleprojects/myproject/locations/connection_location/connections/myconnection.

    To use adefault connection,specifyDEFAULT instead of the connection string containingPROJECT_ID.REGION.CONNECTION_ID.

  • DATASET_NAME: the name of the dataset where youwant to create the BigLake table

  • TABLE_NAME: the name of the BigLake table

  • SCHEMA: the schema for theBigLake table

Example:

bq mk --table \    --external_table_definition=@CSV=gs://mybucket/*.parquet@projects/myproject/locations/us/connections/myconnection \    --max_staleness=0-0 0 4:0:0 \    myproject:mydataset.myexternaltable \    Region:STRING,Quarter:STRING,Total_sales:INTEGER

API

Call thetables.insert methodAPI method, and create anExternalDataConfigurationin theTable resourcethat you pass in.

Specify theschema property or set theautodetect property totrue to enable schema auto detection forsupported data sources.

Specify theconnectionId property to identify the connection to usefor connecting to Cloud Storage.

Terraform

This example creates a BigLake Table on unpartitioned data.

To authenticate to BigQuery, set up Application DefaultCredentials. For more information, seeSet up authentication for client libraries.

# This creates a bucket in the US region named "my-bucket" with a pseudorandom suffix.resource "random_id" "default" {  byte_length = 8}resource "google_storage_bucket" "default" {  name                        = "my-bucket-${random_id.default.hex}"  location                    = "US"  force_destroy               = true  uniform_bucket_level_access = true}# This queries the provider for project information.data "google_project" "project" {}# This creates a connection in the US region named "my-connection".# This connection is used to access the bucket.resource "google_bigquery_connection" "default" {  connection_id = "my-connection"  location      = "US"  cloud_resource {}}# This grants the previous connection IAM role access to the bucket.resource "google_project_iam_member" "default" {  role    = "roles/storage.objectViewer"  project = data.google_project.project.id  member  = "serviceAccount:${google_bigquery_connection.default.cloud_resource[0].service_account_id}"}# This makes the script wait for seven minutes before proceeding.# This lets IAM permissions propagate.resource "time_sleep" "default" {  create_duration = "7m"  depends_on = [google_project_iam_member.default]}# This defines a Google BigQuery dataset with# default expiration times for partitions and tables, a# description, a location, and a maximum time travel.resource "google_bigquery_dataset" "default" {  dataset_id                      = "my_dataset"  default_partition_expiration_ms = 2592000000  # 30 days  default_table_expiration_ms     = 31536000000 # 365 days  description                     = "My dataset description"  location                        = "US"  max_time_travel_hours           = 96 # 4 days  # This defines a map of labels for the bucket resource,  # including the labels "billing_group" and "pii".  labels = {    billing_group = "accounting",    pii           = "sensitive"  }}# This creates a BigQuery Table with automatic metadata caching.resource "google_bigquery_table" "default" {  dataset_id = google_bigquery_dataset.default.dataset_id  table_id   = "my_table"  schema = jsonencode([    { "name" : "country", "type" : "STRING" },    { "name" : "product", "type" : "STRING" },    { "name" : "price", "type" : "INT64" }  ])  external_data_configuration {    # This defines an external data configuration for the BigQuery table    # that reads Parquet data from the publish directory of the default    # Google Cloud Storage bucket.    autodetect    = false    source_format = "PARQUET"    connection_id = google_bigquery_connection.default.name    source_uris   = ["gs://${google_storage_bucket.default.name}/data/*"]    # This enables automatic metadata refresh.    metadata_cache_mode = "AUTOMATIC"  }  # This sets the maximum staleness of the metadata cache to 10 hours.  max_staleness = "0-0 0 10:0:0"  deletion_protection = false  depends_on = [time_sleep.default]}

To apply your Terraform configuration in a Google Cloud project, complete the steps in the following sections.

Prepare Cloud Shell

  1. LaunchCloud Shell.
  2. Set the default Google Cloud project where you want to apply your Terraform configurations.

    You only need to run this command once per project, and you can run it in any directory.

    export GOOGLE_CLOUD_PROJECT=PROJECT_ID

    Environment variables are overridden if you set explicit values in the Terraform configuration file.

Prepare the directory

Each Terraform configuration file must have its own directory (alsocalled aroot module).

  1. InCloud Shell, create a directory and a new file within that directory. The filename must have the.tf extension—for examplemain.tf. In this tutorial, the file is referred to asmain.tf.
    mkdirDIRECTORY && cdDIRECTORY && touch main.tf
  2. If you are following a tutorial, you can copy the sample code in each section or step.

    Copy the sample code into the newly createdmain.tf.

    Optionally, copy the code from GitHub. This is recommended when the Terraform snippet is part of an end-to-end solution.

  3. Review and modify the sample parameters to apply to your environment.
  4. Save your changes.
  5. Initialize Terraform. You only need to do this once per directory.
    terraform init

    Optionally, to use the latest Google provider version, include the-upgrade option:

    terraform init -upgrade

Apply the changes

  1. Review the configuration and verify that the resources that Terraform is going to create or update match your expectations:
    terraform plan

    Make corrections to the configuration as necessary.

  2. Apply the Terraform configuration by running the following command and enteringyes at the prompt:
    terraform apply

    Wait until Terraform displays the "Apply complete!" message.

  3. Open your Google Cloud project to view the results. In the Google Cloud console, navigate to your resources in the UI to make sure that Terraform has created or updated them.
Note: Terraform samples typically assume that the required APIs are enabled in your Google Cloud project.

BigLake supports schema autodetection. However, if you did notprovide a schema and the service account was not granted access in the previoussteps, these steps fail with an access denied message if you try to autodetectthe schema.

Create BigLake tables on Apache Hive partitioned data

You can create a BigLake table for Hive partitioned data inCloud Storage. After you create an externally partitioned table, youcan't change the partition key. You need to recreate the table to change thepartition key.

To create a BigLake table based on Hive partitioned data inCloud Storage, select one of the following options:

Console

  1. Go to theBigQuery page.

    Go to BigQuery

  2. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

    If you don't see the left pane, clickExpand left pane to open the pane.

  3. In theExplorer pane, expand your project, clickDatasets, andthen click a dataset.

  4. ClickCreate table. TheCreate table pane opens.

  5. In theSource section, specify the following details:

    1. ForCreate table from, selectGoogle Cloud Storage.

    2. Provide the path to the folder, usingwildcards.For example,my_bucket/my_files*.The foldermust be in the same location as the dataset that contains thetable you want to create, append, or overwrite.

    3. From theFile format list, select the file type.

    4. Select theSource data partitioning checkbox, and then specifythe following details:

      1. ForSelect Source URI Prefix, enter theURI prefix. For example,gs://my_bucket/my_files.
      2. Optional: To require a partition filter on all queries for thistable, select theRequire partition filter checkbox.Requiring a partition filter can reduce cost and improveperformance. For more information, seeRequiring predicate filters on partition keys in queries.
      3. In thePartition inference mode section, select one of thefollowing options:

        • Automatically infer types: set the partition schemadetection mode toAUTO.
        • All columns are strings: set the partition schemadetection mode toSTRINGS.
        • Provide my own: set the partition schema detection mode toCUSTOM and manually enter the schemainformation for the partition keys. For more information, seeProvide a custom partition key schema.
  6. In theDestination section, specify the following details:

    1. ForProject, select the project in which you want to createthe table.
    2. ForDataset, select the dataset in which you want to createthe table.
    3. ForTable, enter the name of the table that you wantto create.
    4. ForTable type, selectExternal table.
    5. Select theCreate a BigLake table using a Cloud Resourceconnection checkbox.
    6. ForConnection ID, select the connection that you createdearlier.
  7. In theSchema section, enableschema auto-detection by selecting theAuto detect option.

  8. To ignore rows with extra column values that don't match the schema,expand theAdvanced options section and selectUnknown values.

  9. ClickCreate table.

SQL

Use theCREATE EXTERNAL TABLE DDL statement:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATEEXTERNALTABLE`PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`WITHPARTITIONCOLUMNS(PARTITION_COLUMNPARTITION_COLUMN_TYPE,)WITHCONNECTION{`PROJECT_ID.REGION.CONNECTION_ID`|DEFAULT}OPTIONS(hive_partition_uri_prefix="HIVE_PARTITION_URI_PREFIX",uris=['FILE_PATH'],max_staleness=STALENESS_INTERVAL,metadata_cache_mode='CACHE_MODE',format="TABLE_FORMAT");

    Replace the following:

    • PROJECT_ID: the name of your project in which you want to create the table—for example,myproject
    • DATASET: the name of the BigQuery dataset that you want to create the table in—for example,mydataset
    • EXTERNAL_TABLE_NAME: the name of the table that you want to create—for example,mytable
    • PARTITION_COLUMN: the name of the partitioning column
    • PARTITION_COLUMN_TYPE: the type of the partitioning column
    • REGION: the region that contains the connection—for example,us
    • CONNECTION_ID: the connection ID—for example,myconnection

      When youview the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown inConnection ID—for exampleprojects/myproject/locations/connection_location/connections/myconnection.

      To use a default connection, specifyDEFAULT instead of the connection string containingPROJECT_ID.REGION.CONNECTION_ID.

    • HIVE_PARTITION_URI_PREFIX: hive partitioninguri prefix–for example,gs://mybucket/
    • FILE_PATH: path to the data source for the external table that you want to create—for example,gs://mybucket/*.parquet
    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the BigLake table, and how fresh the cached metadata must be in order for the operation to use it. For more information about metadata caching considerations, seeMetadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify aninterval literal value between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOUR for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.

    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually. For more information about metadata caching considerations, seeMetadata caching for performance.

      Set toAUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set toMANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a value greater than 0.

    • TABLE_FORMAT: the format of the table that you want to create—for example,PARQUET

  3. ClickRun.

For more information about how to run queries, seeRun an interactive query.

Examples

The following example creates a BigLake table overpartitioned data where:

  • The schema is autodetected.
  • The metadata cache staleness interval for the table is 1 day.
  • The metadata cache refreshes automatically.
CREATEEXTERNALTABLE`my_dataset.my_table`WITHPARTITIONCOLUMNS(skuSTRING,)WITHCONNECTION`us.my-connection`OPTIONS(hive_partition_uri_prefix="gs://mybucket/products",uris=['gs://mybucket/products/*'],max_staleness=INTERVAL1DAY,metadata_cache_mode='AUTOMATIC');

The following example creates a BigLake table overpartitioned data where:

  • The schema is specified.
  • The metadata cache staleness interval for the table is 8 hours.
  • The metadata cache must be manually refreshed.
CREATEEXTERNALTABLE`my_dataset.my_table`(ProductIdINTEGER,ProductNameSTRING,ProductTypeSTRING)WITHPARTITIONCOLUMNS(skuSTRING,)WITHCONNECTION`us.my-connection`OPTIONS(hive_partition_uri_prefix="gs://mybucket/products",uris=['gs://mybucket/products/*'],max_staleness=INTERVAL8HOUR,metadata_cache_mode='MANUAL');

bq

First, use thebq mkdef command tocreate a table definition file:

bqmkdef\--source_format=SOURCE_FORMAT\--connection_id=REGION.CONNECTION_ID\--hive_partitioning_mode=PARTITIONING_MODE\--hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX\--require_hive_partition_filter=BOOLEAN\--metadata_cache_mode=CACHE_MODE\GCS_URIS>DEFINITION_FILE

Replace the following:

  • SOURCE_FORMAT: the format of the external datasource. For example,CSV.
  • REGION: the region that contains theconnection—for example,us.
  • CONNECTION_ID: the connection ID—forexample,myconnection.

    When youview the connection detailsin the Google Cloud console, the connection ID is the value in the last section of thefully qualified connection ID that is shown inConnection ID—for exampleprojects/myproject/locations/connection_location/connections/myconnection.

    To use adefault connection,specifyDEFAULT instead of the connection string containingPROJECT_ID.REGION.CONNECTION_ID.

  • PARTITIONING_MODE: the Hive partitioning mode. Use one of thefollowing values:

    • AUTO: Automatically detect the key names and types.
    • STRINGS: Automatically convert the key names to strings.
    • CUSTOM: Encode the key schema in the source URI prefix.
  • GCS_URI_SHARED_PREFIX: the source URI prefix.

  • BOOLEAN: specifies whether to require a predicate filter at querytime. This flag is optional. The default value isfalse.

  • CACHE_MODE: specifies whether the metadatacache is refreshed automatically or manually. You only need to include thisflag if you also plan to use the--max_staleness flag in thesubsequentbq mk command to enable metadata caching. For moreinformation on metadata caching considerations, seeMetadata caching for performance.

    Set toAUTOMATIC for the metadata cache to berefreshed at a system-defined interval, usually somewhere between 30 and60 minutes.

    Set toMANUAL if you want to refreshthe metadata cache on a schedule you determine. In this case, you can calltheBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

    You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a value greaterthan 0.

  • GCS_URIS: the path to the Cloud Storage folder, usingwildcard format.

  • DEFINITION_FILE: the path to thetable definition file on your localmachine.

IfPARTITIONING_MODE isCUSTOM, include the partition key schemain the source URI prefix, using the following format:

--hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX/{KEY1:TYPE1}/{KEY2:TYPE2}/...

After you create the table definition file, use thebq mk command tocreate the BigLake table:

bqmk--external_table_definition=DEFINITION_FILE\--max_staleness=STALENESS_INTERVAL\DATASET_NAME.TABLE_NAME\SCHEMA

Replace the following:

  • DEFINITION_FILE: the path to the table definition file.
  • STALENESS_INTERVAL: specifies whethercached metadata is used by operations against theBigLake table, andhow fresh the cached metadata must be in order for the operation touse it. If you include this flag, you must have also specified a valuefor the--metadata_cache_mode flag in the precedingbq mkdef command. For more information on metadatacaching considerations, seeMetadata caching for performance.

    To disable metadata caching, specify 0. This is the default.

    To enable metadata caching, specify an interval value between 30 minutes and 7 days, using theY-M D H:M:S format described in theINTERVAL data type documentation. For example, specify0-0 0 4:0:0 for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.

  • DATASET_NAME: the name of the dataset that contains thetable.

  • TABLE_NAME: the name of the table you're creating.

  • SCHEMA: specifies a path to aJSON schema file,or specifies the schema in the formfield:data_type,field:data_type,.... To use schemaauto-detection, omit this argument.

Examples

The following example usesAUTO Hive partitioning mode, and alsosets the metadata cache to have a 12 hour staleness interval and to getrefreshed automatically:

bq mkdef --source_format=CSV \  --connection_id=us.my-connection \  --hive_partitioning_mode=AUTO \  --hive_partitioning_source_uri_prefix=gs://myBucket/myTable \  --metadata_cache_mode=AUTOMATIC \  gs://myBucket/myTable/* > mytable_defbq mk --external_table_definition=mytable_def \  --max_staleness=0-0 0 12:0:0 \  mydataset.mytable \  Region:STRING,Quarter:STRING,Total_sales:INTEGER

The following example usesSTRING Hive partitioning mode:

bq mkdef --source_format=CSV \  --connection_id=us.my-connection \  --hive_partitioning_mode=STRING \  --hive_partitioning_source_uri_prefix=gs://myBucket/myTable \  gs://myBucket/myTable/* > mytable_defbq mk --external_table_definition=mytable_def \  mydataset.mytable \  Region:STRING,Quarter:STRING,Total_sales:INTEGER

The following example usesCUSTOM Hive partitioning mode:

bq mkdef --source_format=CSV \  --connection_id=us.my-connection \  --hive_partitioning_mode=CUSTOM \  --hive_partitioning_source_uri_prefix=gs://myBucket/myTable/{dt:DATE}/{val:STRING} \  gs://myBucket/myTable/* > mytable_defbq mk --external_table_definition=mytable_def \  mydataset.mytable \  Region:STRING,Quarter:STRING,Total_sales:INTEGER

API

To set Hive partitioning using the BigQuery API, include thehivePartitioningOptionsobject in theExternalDataConfigurationobject when you create thetable definition file.To create a BigLake table, you must also specify a valuefor theconnectionId field.

If you set thehivePartitioningOptions.mode field toCUSTOM, you mustencode the partition key schema in thehivePartitioningOptions.sourceUriPrefix field as follows:gs://BUCKET/PATH_TO_TABLE/{KEY1:TYPE1}/{KEY2:TYPE2}/...

To enforce the use of a predicate filter at query time, set thehivePartitioningOptions.requirePartitionFilter field totrue.

Terraform

This example creates a BigLake Table on partitioned data.

To authenticate to BigQuery, set up Application DefaultCredentials. For more information, seeSet up authentication for client libraries.

# This creates a bucket in the US region named "my-bucket" with a pseudorandom# suffix.resource "random_id" "default" {  byte_length = 8}resource "google_storage_bucket" "default" {  name                        = "my-bucket-${random_id.default.hex}"  location                    = "US"  force_destroy               = true  uniform_bucket_level_access = true}resource "google_storage_bucket_object" "default" {  # This creates a fake message to create partition locations on the table.  # Otherwise, the table deployment fails.  name    = "publish/dt=2000-01-01/hr=00/min=00/fake_message.json"  content = "{\"column1\": \"XXX\"}"  bucket  = google_storage_bucket.default.name}# This queries the provider for project information.data "google_project" "default" {}# This creates a connection in the US region named "my-connection".# This connection is used to access the bucket.resource "google_bigquery_connection" "default" {  connection_id = "my-connection"  location      = "US"  cloud_resource {}}# This grants the previous connection IAM role access to the bucket.resource "google_project_iam_member" "default" {  role    = "roles/storage.objectViewer"  project = data.google_project.default.id  member  = "serviceAccount:${google_bigquery_connection.default.cloud_resource[0].service_account_id}"}# This makes the script wait for seven minutes before proceeding. This lets IAM# permissions propagate.resource "time_sleep" "default" {  create_duration = "7m"  depends_on = [google_project_iam_member.default]}# This defines a Google BigQuery dataset with default expiration times for# partitions and tables, a description, a location, and a maximum time travel.resource "google_bigquery_dataset" "default" {  dataset_id                      = "my_dataset"  default_partition_expiration_ms = 2592000000  # 30 days  default_table_expiration_ms     = 31536000000 # 365 days  description                     = "My dataset description"  location                        = "US"  max_time_travel_hours           = 96 # 4 days  # This defines a map of labels for the bucket resource,  # including the labels "billing_group" and "pii".  labels = {    billing_group = "accounting",    pii           = "sensitive"  }}# This creates a BigQuery table with partitioning and automatic metadata# caching.resource "google_bigquery_table" "default" {  dataset_id = google_bigquery_dataset.default.dataset_id  table_id   = "my_table"  schema     = jsonencode([{ "name" : "column1", "type" : "STRING", "mode" : "NULLABLE" }])  external_data_configuration {    # This defines an external data configuration for the BigQuery table    # that reads Parquet data from the publish directory of the default    # Google Cloud Storage bucket.    autodetect    = false    source_format = "PARQUET"    connection_id = google_bigquery_connection.default.name    source_uris   = ["gs://${google_storage_bucket.default.name}/publish/*"]    # This configures Hive partitioning for the BigQuery table,    # partitioning the data by date and time.    hive_partitioning_options {      mode                     = "CUSTOM"      source_uri_prefix        = "gs://${google_storage_bucket.default.name}/publish/{dt:STRING}/{hr:STRING}/{min:STRING}"      require_partition_filter = false    }    # This enables automatic metadata refresh.    metadata_cache_mode = "AUTOMATIC"  }  # This sets the maximum staleness of the metadata cache to 10 hours.  max_staleness = "0-0 0 10:0:0"  deletion_protection = false  depends_on = [    time_sleep.default,    google_storage_bucket_object.default  ]}

To apply your Terraform configuration in a Google Cloud project, complete the steps in the following sections.

Prepare Cloud Shell

  1. LaunchCloud Shell.
  2. Set the default Google Cloud project where you want to apply your Terraform configurations.

    You only need to run this command once per project, and you can run it in any directory.

    export GOOGLE_CLOUD_PROJECT=PROJECT_ID

    Environment variables are overridden if you set explicit values in the Terraform configuration file.

Prepare the directory

Each Terraform configuration file must have its own directory (alsocalled aroot module).

  1. InCloud Shell, create a directory and a new file within that directory. The filename must have the.tf extension—for examplemain.tf. In this tutorial, the file is referred to asmain.tf.
    mkdirDIRECTORY && cdDIRECTORY && touch main.tf
  2. If you are following a tutorial, you can copy the sample code in each section or step.

    Copy the sample code into the newly createdmain.tf.

    Optionally, copy the code from GitHub. This is recommended when the Terraform snippet is part of an end-to-end solution.

  3. Review and modify the sample parameters to apply to your environment.
  4. Save your changes.
  5. Initialize Terraform. You only need to do this once per directory.
    terraform init

    Optionally, to use the latest Google provider version, include the-upgrade option:

    terraform init -upgrade

Apply the changes

  1. Review the configuration and verify that the resources that Terraform is going to create or update match your expectations:
    terraform plan

    Make corrections to the configuration as necessary.

  2. Apply the Terraform configuration by running the following command and enteringyes at the prompt:
    terraform apply

    Wait until Terraform displays the "Apply complete!" message.

  3. Open your Google Cloud project to view the results. In the Google Cloud console, navigate to your resources in the UI to make sure that Terraform has created or updated them.
Note: Terraform samples typically assume that the required APIs are enabled in your Google Cloud project.

Set up access control policies

You can use several methods to control access to BigLake tables:

For example, let's say you want to limit row access for the tablemytablein the datasetmydataset:

+---------+---------+-------+| country | product | price |+---------+---------+-------+| US      | phone   |   100 || JP      | tablet  |   300 || UK      | laptop  |   200 |+---------+---------+-------+

You can create a row-level filter for Kim (kim@example.com) that restrictstheir access to rows wherecountry is equal toUS.

CREATEROWACCESSPOLICYonly_us_filterONmydataset.mytableGRANTTO('user:kim@example.com')FILTERUSING(country='US');

Then, Kim runs the following query:

SELECT*FROMprojectid.mydataset.mytable;

The output shows only the rows wherecountry is equal toUS:

+---------+---------+-------+| country | product | price |+---------+---------+-------+| US      | phone   |   100 |+---------+---------+-------+

Query BigLake tables

For more information, seeQuery Cloud Storage data in BigLake tables.

Update BigLake tables

You can update BigLake tables if necessary, for example tochange theirmetadata caching.To get table details such as source format and source URI, seeGet table information.

You can also use this same procedure to upgrade Cloud Storage-basedexternal tables to BigLake tables by associating the externaltable to a connection. For more information, seeUpgrade external tables to BigLake tables.

To update a BigLake table, select one of thefollowing options:

SQL

Use theCREATE OR REPLACE EXTERNAL TABLE DDL statementto update a table:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATEORREPLACEEXTERNALTABLE`PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`WITHCONNECTION{`REGION.CONNECTION_ID`|DEFAULT}OPTIONS(format="TABLE_FORMAT",uris=['BUCKET_PATH'],max_staleness=STALENESS_INTERVAL,metadata_cache_mode='CACHE_MODE');

    Replace the following:

    • PROJECT_ID: the name of the project that contains the table
    • DATASET: the name of the dataset that contains the table
    • EXTERNAL_TABLE_NAME: the name of the table
    • REGION: the region that contains the connection
    • CONNECTION_ID: the name of the connection to use

      To use a default connection, specifyDEFAULT instead of the connection string containingREGION.CONNECTION_ID.

    • TABLE_FORMAT: the format used by the table

      You can't change this when updating the table.

    • BUCKET_PATH: the path to the Cloud Storage bucket that contains the data for the external table, in the format['gs://bucket_name/[folder_name/]file_name'].

      You can select multiple files from the bucket by specifying one asterisk (*) wildcard character in the path. For example,['gs://mybucket/file_name*']. For more information, seeWildcard support for Cloud Storage URIs.

      You can specify multiple buckets for theuris option by providing multiple paths.

      The following examples show validuris values:

      • ['gs://bucket/path1/myfile.csv']
      • ['gs://bucket/path1/*.csv']
      • ['gs://bucket/path1/*', 'gs://bucket/path2/file00*']

      When you specifyuris values that target multiple files, all of those files must share a compatible schema.

      For more information about using Cloud Storage URIs in BigQuery, seeCloud Storage resource path.

    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the table, and how fresh the cached metadata must be in order for the operation to use it

      For more information about metadata caching considerations, seeMetadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify aninterval literal value between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOUR for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.

    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually

      For more information on metadata caching considerations, seeMetadata caching for performance.

      Set toAUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set toMANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a value greater than 0.

  3. ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

Use thebq mkdef andbq update commandsto update a table:

  1. Generate anexternal table definition,that describes the aspects of the table to change:

    bqmkdef--connection_id=PROJECT_ID.REGION.CONNECTION_ID\--source_format=TABLE_FORMAT\--metadata_cache_mode=CACHE_MODE\"BUCKET_PATH">/tmp/DEFINITION_FILE

    Replace the following:

    • PROJECT_ID: the name of the project thatcontains the connection
    • REGION: the region that contains theconnection
    • CONNECTION_ID: the name of the connectionto use
    • TABLE_FORMAT: the format used by thetable. You can't change this when updating the table.
    • CACHE_MODE: specifies whether the metadatacache is refreshed automatically or manually. For more informationon metadata caching considerations, seeMetadata caching for performance.

      Set toAUTOMATIC for the metadata cache to be refreshed at asystem-defined interval, usually somewhere between 30 and60 minutes.

      Set toMANUAL if you want to refresh the metadata cache on aschedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refreshthe cache.

      You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a valuegreater than 0.

    • BUCKET_PATH: the path to theCloud Storage bucket that contains the data for theexternal table, in the formatgs://bucket_name/[folder_name/]file_name.

      You can limit the files selected from the bucket by specifying one asterisk (*)wildcard character in the path. For example,gs://mybucket/file_name*. For moreinformation, seeWildcard support for Cloud Storage URIs.

      You can specify multiple buckets for theuris option by providing multiplepaths.

      The following examples show validuris values:

      • gs://bucket/path1/myfile.csv
      • gs://bucket/path1/*.csv
      • gs://bucket/path1/*,gs://bucket/path2/file00*

      When you specifyuris values that target multiple files, all of thosefiles must share a compatible schema.

      For more information about using Cloud Storage URIs inBigQuery, seeCloud Storage resource path.

    • DEFINITION_FILE: the name of the tabledefinition file that you are creating.

  2. Update the table using the new external table definition:

    bqupdate--max_staleness=STALENESS_INTERVAL\--external_table_definition=/tmp/DEFINITION_FILE\PROJECT_ID:DATASET.EXTERNAL_TABLE_NAME

    Replace the following:

    • STALENESS_INTERVAL: specifies whethercached metadata is used by operations against thetable, and how fresh the cached metadata must be in order forthe operation to use it. For more information about metadatacaching considerations, seeMetadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify an interval value between 30minutes and 7 days, using theY-M D H:M:S format described in theINTERVAL data typedocumentation. For example, specify0-0 0 4:0:0 for a 4hour staleness interval.With this value, operations against the table use cached metadata ifit has been refreshed within the past 4 hours. If the cached metadatais older than that, the operation retrieves metadata fromCloud Storage instead.

    • DEFINITION_FILE: the name of the tabledefinition file that you created or updated.

    • PROJECT_ID: the name of the projectthat contains the table

    • DATASET: the name of the dataset thatcontains the table

    • EXTERNAL_TABLE_NAME: the name of the table

Example

The following example updatesmytable to use cached metadata as long asit has been refreshed in the last 4.5 hours, and also to refresh cachedmetadata automatically:

bq update --project_id=myproject --max_staleness='0-0 0 4:30:0' \--external_table_definition=enable_metadata.json mydataset.mytable

Whereenable_metadata.json has the following contents:json{"metadataCacheMode": "AUTOMATIC"}

Audit logging

For information about logging in BigQuery, seeIntroduction toBigQuery monitoring. To learn moreabout logging in Google Cloud, seeCloud Logging.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.