Create Amazon S3 BigLake external tables

Important: The term "BigLake" on this page refers to an accessdelegation functionality for external tables in BigQuery. Forinformation about BigLake, the stand-alone Google Cloudproduct that includes BigLake metastore, the Apache Iceberg REST catalog,and BigLake tables for Apache Iceberg seeBigLake overview.

This document describes how to create an Amazon Simple Storage Service (Amazon S3) BigLake table. ABigLake table lets you useaccess delegation to query data in Amazon S3. Access delegationdecouples access to the BigLake table from access to theunderlying datastore.

For information about how data flows between BigQuery andAmazon S3, seeData flow when querying data.

Before you begin

Ensure that you have aconnection to access Amazon S3 data.

Required roles

To get the permissions that you need to create an external table, ask your administrator to grant you theBigQuery Admin (roles/bigquery.admin) IAM role on your dataset. For more information about granting roles, seeManage access to projects, folders, and organizations.

This predefined role contains the permissions required to create an external table. To see the exact permissions that are required, expand theRequired permissions section:

Required permissions

The following permissions are required to create an external table:

  • bigquery.tables.create
  • bigquery.connections.delegate

You might also be able to get these permissions withcustom roles or otherpredefined roles.

Create a dataset

Before you create an external table, you need to create a dataset in thesupported region. Select one ofthe following options:

Console

  1. Go to theBigQuery page.

    Go to BigQuery

  2. In the left pane, clickExplorer.
  3. In theExplorer pane, select the project where you want to create the dataset.
  4. ClickView actions, and then clickCreate dataset.
  5. On theCreate dataset page, specify the following details:
    1. ForDataset ID enter a unique datasetname.
    2. ForData location choose asupported region.
    3. Optional: To delete tables automatically, select theEnable table expiration checkbox and set theDefault maximum table age in days. Data in Amazon S3 is not deleted when the table expires.
    4. If you want to usedefault collation, expand theAdvanced options section and then select theEnable default collation option.
    5. ClickCreate dataset.

SQL

Use theCREATE SCHEMA DDL statement.The following example create a dataset in theaws-us-east-1 region:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATESCHEMAmydatasetOPTIONS(location='aws-us-east-1');

  3. ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

In a command-line environment, create a dataset using thebq mkcommand:

bq--location=LOCATIONmk\--dataset\PROJECT_ID:DATASET_NAME

The--project_id parameter overrides the default project.

Replace the following:

  • LOCATION: the location of your dataset

    For information about supported regions, seeLocations.After youcreate a dataset, you can't change its location. You can set a defaultvalue for the location by using the.bigqueryrc file.

  • PROJECT_ID: your project ID

  • DATASET_NAME: the name of the dataset thatyou want to create

    To create a dataset in a project other than your default project, add theproject ID to the dataset name in the following format:PROJECT_ID:DATASET_NAME.

Java

Before trying this sample, follow theJava setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryJava API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.Dataset;importcom.google.cloud.bigquery.DatasetInfo;// Sample to create a aws datasetpublicclassCreateDatasetAws{publicstaticvoidmain(String[]args){// TODO(developer): Replace these variables before running the sample.StringprojectId="MY_PROJECT_ID";StringdatasetName="MY_DATASET_NAME";// Note: As of now location only supports aws-us-east-1Stringlocation="aws-us-east-1";createDatasetAws(projectId,datasetName,location);}publicstaticvoidcreateDatasetAws(StringprojectId,StringdatasetName,Stringlocation){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();DatasetInfodatasetInfo=DatasetInfo.newBuilder(projectId,datasetName).setLocation(location).build();Datasetdataset=bigquery.create(datasetInfo);System.out.println("Aws dataset created successfully :"+dataset.getDatasetId().getDataset());}catch(BigQueryExceptione){System.out.println("Aws dataset was not created. \n"+e.toString());}}}

Create BigLake tables on unpartitioned data

Select one of the following options:

Console

  1. Go to theBigQuery page.

    Go to BigQuery

  2. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

    If you don't see the left pane, clickExpand left pane to open the pane.

  3. In theExplorer pane, expand your project, clickDatasets, and then select a dataset.

  4. In theDataset info section, clickCreate table.

  5. On theCreate table page, in theSource section, do the following:

    1. ForCreate table from, selectAmazon S3.
    2. ForSelect S3 path, enter a URI pointing to the Amazon S3data in the formats3://BUCKET_NAME/PATH. ReplaceBUCKET_NAME with the name of the Amazon S3 bucket;the bucket's region should be the same as the dataset's region.ReplacePATH with the path that you would liketo write the exported file to; it can contain one wildcard*.
    3. ForFile format, select the data format in Amazon S3. Supported formatsareAVRO,CSV,DELTA_LAKE,ICEBERG,JSONL,ORC, andPARQUET.
  6. In theDestination section, specify thefollowing details:

    1. ForDataset, choose the appropriate dataset.
    2. In theTable field, enter the name of the table.
    3. Verify thatTable type is set toExternal table.
    4. ForConnection ID, choose the appropriate connection ID from thedrop-down. For information about connections, seeConnect to Amazon S3.
  7. In theSchema section, you can either enableschema auto-detection or manually specifya schema if you have a source file. If you don't have a source file, youmust manually specify a schema.

    • To enable schema auto-detection, select theAuto-detect option.

    • To manually specify a schema, leave theAuto-detect optionunchecked. EnableEdit as text and enter the table schema as aJSON array.

  8. ClickCreate table.

SQL

To create a BigLake table, use theCREATE EXTERNAL TABLE statement with theWITH CONNECTION clause:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATEEXTERNALTABLEDATASET_NAME.TABLE_NAMEWITHCONNECTION`AWS_LOCATION.CONNECTION_NAME`OPTIONS(format="DATA_FORMAT",uris=["S3_URI"],max_staleness=STALENESS_INTERVAL,metadata_cache_mode='CACHE_MODE');

    Replace the following:

    • DATASET_NAME: the name of the dataset you created
    • TABLE_NAME: the name you want to give to this table
    • AWS_LOCATION: an AWS location in Google Cloud (for example, `aws-us-east-1`)
    • CONNECTION_NAME: the name of the connection you created
    • DATA_FORMAT: any of the supportedBigQuery federated formats (such asAVRO,CSV,DELTA_LAKE,ICEBERG, orPARQUET (preview))
    • S3_URI: a URI pointing to the Amazon S3 data (for example,s3://bucket/path)
    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the BigLake table, and how fresh the cached metadata must be in order for the operation to use it. For more information about metadata caching considerations, seeMetadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify aninterval literal value between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOUR for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Amazon S3 instead.

    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually. For more information about metadata caching considerations, seeMetadata caching for performance.

      Set toAUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set toMANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a value greater than 0.

  3. ClickRun.

For more information about how to run queries, seeRun an interactive query.

Example:

CREATEEXTERNALTABLEawsdataset.awstableWITHCONNECTION`aws-us-east-1.s3-read-connection`OPTIONS(format="CSV",uris=["s3://s3-bucket/path/file.csv"],max_staleness=INTERVAL1DAY,metadata_cache_mode='AUTOMATIC');

bq

Create atable definition file:

bqmkdef\--source_format=DATA_FORMAT\--connection_id=AWS_LOCATION.CONNECTION_NAME\--metadata_cache_mode=CACHE_MODE\S3_URI>table_def

Replace the following:

  • DATA_FORMAT: any of the supportedBigQuery federated formats(such asAVRO,CSV,DELTA_LAKE,ICEBERG, orPARQUET).
  • S3_URI: a URI pointing to the Amazon S3 data (for example,s3://bucket/path).
  • AWS_LOCATION: an AWS location in Google Cloud (forexample,aws-us-east-1).
  • CONNECTION_NAME: the name of the connection youcreated.

  • CACHE_MODE: specifies whether the metadatacache is refreshed automatically or manually. You only need to includethis flag if you also plan to use the--max_staleness flagin the subsequentbq mk command to enable metadata caching.For more information about metadata caching considerations, seeMetadata caching for performance.

    Set toAUTOMATIC for the metadata cache to berefreshed at a system-defined interval, usually somewhere between 30 and60 minutes.

    Set toMANUAL if you want to refreshthe metadata cache on a schedule you determine. In this case, you can calltheBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a value greaterthan 0.

Note: To override the default project, use the--project_id=PROJECT_ID parameter. ReplacePROJECT_ID with the ID of your Google Cloud project.

Next, create the BigLake table:

bqmk--max_staleness=STALENESS_INTERVAL--external_table_definition=table_defDATASET_NAME.TABLE_NAME

Replace the following:

For example, the following command creates a new BigLake table,awsdataset.awstable, which can query your Amazon S3 data that's storedat the paths3://s3-bucket/path/file.csv and has a read connection in thelocationaws-us-east-1:

bqmkdef\--autodetect\--source_format=CSV\--connection_id=aws-us-east-1.s3-read-connection\--metadata_cache_mode=AUTOMATIC\s3://s3-bucket/path/file.csv>table_defbqmk--max_staleness=INTERVAL"1"HOUR\--external_table_definition=table_defawsdataset.awstable

API

Call thetables.insert methodAPI method, and create anExternalDataConfigurationin theTable resourcethat you pass in.

Specify theschema property or set theautodetect property totrue to enable schema auto detection forsupported data sources.

Specify theconnectionId property to identify the connection to usefor connecting to Amazon S3.

Java

Before trying this sample, follow theJava setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryJava API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.CsvOptions;importcom.google.cloud.bigquery.ExternalTableDefinition;importcom.google.cloud.bigquery.Field;importcom.google.cloud.bigquery.Schema;importcom.google.cloud.bigquery.StandardSQLTypeName;importcom.google.cloud.bigquery.TableId;importcom.google.cloud.bigquery.TableInfo;// Sample to create an external aws tablepublicclassCreateExternalTableAws{publicstaticvoidmain(String[]args){// TODO(developer): Replace these variables before running the sample.StringprojectId="MY_PROJECT_ID";StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";StringconnectionId="MY_CONNECTION_ID";StringsourceUri="s3://your-bucket-name/";CsvOptionsoptions=CsvOptions.newBuilder().setSkipLeadingRows(1).build();Schemaschema=Schema.of(Field.of("name",StandardSQLTypeName.STRING),Field.of("post_abbr",StandardSQLTypeName.STRING));ExternalTableDefinitionexternalTableDefinition=ExternalTableDefinition.newBuilder(sourceUri,options).setConnectionId(connectionId).setSchema(schema).build();createExternalTableAws(projectId,datasetName,tableName,externalTableDefinition);}publicstaticvoidcreateExternalTableAws(StringprojectId,StringdatasetName,StringtableName,ExternalTableDefinitionexternalTableDefinition){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();TableIdtableId=TableId.of(projectId,datasetName,tableName);TableInfotableInfo=TableInfo.newBuilder(tableId,externalTableDefinition).build();bigquery.create(tableInfo);System.out.println("Aws external table created successfully");// Clean upbigquery.delete(TableId.of(projectId,datasetName,tableName));}catch(BigQueryExceptione){System.out.println("Aws external was not created."+e.toString());}}}

Create BigLake tables on partitioned data

You can create a BigLake table for Hive partitioned data inAmazon S3. After you create an externally partitioned table, you can'tchange the partition key. You need to recreate the table to change thepartition key.

To create a BigLake table based on Hive partitioned data,select one of the following options:

Console

  1. Go to theBigQuery page.

    Go to BigQuery

  2. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

  3. In theExplorer pane, expand your project, clickDatasets, andselect a dataset.

  4. ClickCreate table. This opens theCreate table pane.

  5. In theSource section, specify the following details:

    1. ForCreate table from, selectAmazon S3.

    2. Provide the path to the folder, usingwildcards.For example,s3://mybucket/*.

      The foldermust be in the same location as the dataset that contains thetable you want to create, append, or overwrite.

    3. From theFile format list, select the file type.

    4. Select theSource data partitioning checkbox, and then specifythe following details:

      1. ForSelect Source URI Prefix, enter theURI prefix. For example,s3://mybucket/my_files.
      2. Optional: To require a partition filter on all queries for thistable, select theRequire partition filter checkbox.Requiring a partition filter can reduce cost and improveperformance. For more information, seeRequiring predicate filters on partition keys in queries.
      3. In thePartition inference mode section, select one of thefollowing options:

        • Automatically infer types: set the partition schemadetection mode toAUTO.
        • All columns are strings: set the partition schemadetection mode toSTRINGS.
        • Provide my own: set the partition schema detection mode toCUSTOM and manually enter the schemainformation for the partition keys. For more information, seeCustom partition key schema.
  6. In theDestination section, specify the following details:

    1. ForProject, select the project in which you want to createthe table.
    2. ForDataset, select the dataset in which you want to createthe table.
    3. ForTable, enter the name of the table that you wantto create.
    4. ForTable type, verify thatExternal table is selected.
    5. ForConnection ID, select the connection that you createdearlier.
  7. In theSchema section, you can either enableschema auto-detection or manually specifya schema if you have a source file. If you don't have a source file, youmust manually specify a schema.

    • To enable schema auto-detection, select theAuto-detect option.

    • To manually specify a schema, leave theAuto-detect optionunchecked. EnableEdit as text and enter the table schema as aJSON array.

  8. To ignore rows with extra column values that don't match the schema,expand theAdvanced options section and selectUnknown values.

  9. ClickCreate table.

SQL

Use theCREATE EXTERNAL TABLE DDL statement:

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATEEXTERNALTABLE`PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`WITHPARTITIONCOLUMNS(PARTITION_COLUMNPARTITION_COLUMN_TYPE,)WITHCONNECTION`PROJECT_ID.REGION.CONNECTION_ID`OPTIONS(hive_partition_uri_prefix="HIVE_PARTITION_URI_PREFIX",uris=['FILE_PATH'],format="TABLE_FORMAT"max_staleness=STALENESS_INTERVAL,metadata_cache_mode='CACHE_MODE');

    Replace the following:

    • PROJECT_ID: the name of your project in which you want to create the table—for example,myproject
    • DATASET: the name of the BigQuery dataset that you want to create the table in—for example,mydataset
    • EXTERNAL_TABLE_NAME: the name of the table that you want to create—for example,mytable
    • PARTITION_COLUMN: the name of the partitioning column
    • PARTITION_COLUMN_TYPE: the type of the partitioning column
    • REGION: the region that contains the connection—for example,us
    • CONNECTION_ID: the name of the connection—for example,myconnection
    • HIVE_PARTITION_URI_PREFIX: hive partitioninguri prefix–for example:s3://mybucket/
    • FILE_PATH: path to the data source for the external table that you want to create—for example:s3://mybucket/*.parquet
    • TABLE_FORMAT: the format of the table that you want to create—for example,PARQUET
    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the BigLake table, and how fresh the cached metadata must be in order for the operation to use it. For more information about metadata caching considerations, seeMetadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify aninterval literal value between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOUR for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Amazon S3 instead.

    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually. For more information about metadata caching considerations, seeMetadata caching for performance.

      Set toAUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set toMANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a value greater than 0.

  3. ClickRun.

For more information about how to run queries, seeRun an interactive query.

Examples

The following example creates a BigLake table overpartitioned data in Amazon S3. The schema is autodetected.

CREATEEXTERNALTABLE`my_dataset.my_table`WITHPARTITIONCOLUMNS(skuSTRING,)WITHCONNECTION`us.my-connection`OPTIONS(hive_partition_uri_prefix="s3://mybucket/products",uris=['s3://mybucket/products/*']max_staleness=INTERVAL1DAY,metadata_cache_mode='AUTOMATIC');

bq

First, use thebq mkdef command tocreate a table definition file:

bqmkdef\--source_format=SOURCE_FORMAT\--connection_id=REGION.CONNECTION_ID\--hive_partitioning_mode=PARTITIONING_MODE\--hive_partitioning_source_uri_prefix=URI_SHARED_PREFIX\--require_hive_partition_filter=BOOLEAN\--metadata_cache_mode=CACHE_MODE\URIS>DEFINITION_FILE

Replace the following:

  • SOURCE_FORMAT: the format of the external datasource. For example,CSV.
  • REGION: the region that contains theconnection—for example,us.
  • CONNECTION_ID: the name of the connection—forexample,myconnection.
  • PARTITIONING_MODE: the Hive partitioning mode. Use one of thefollowing values:
    • AUTO: Automatically detect the key names and types.
    • STRINGS: Automatically convert the key names to strings.
    • CUSTOM: Encode the key schema in the source URI prefix.
  • URI_SHARED_PREFIX: the source URI prefix.
  • BOOLEAN: specifies whether to require a predicate filter at querytime. This flag is optional. The default value isfalse.

  • CACHE_MODE: specifies whether the metadatacache is refreshed automatically or manually. You only need to includethis flag if you also plan to use the--max_staleness flagin the subsequentbq mk command to enable metadata caching.For more information about metadata caching considerations, seeMetadata caching for performance.

    Set toAUTOMATIC for the metadata cache to berefreshed at a system-defined interval, usually somewhere between 30 and60 minutes.

    Set toMANUAL if you want to refreshthe metadata cache on a schedule you determine. In this case, you can calltheBQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.You must setCACHE_MODE ifSTALENESS_INTERVAL is set to a value greaterthan 0.

  • URIS: the path to the Amazon S3 folder, usingwildcard format.

  • DEFINITION_FILE: the path to thetable definition file on your localmachine.

IfPARTITIONING_MODE isCUSTOM, include the partition key schemain the source URI prefix, using the following format:

--hive_partitioning_source_uri_prefix=URI_SHARED_PREFIX/{KEY1:TYPE1}/{KEY2:TYPE2}/...

After you create the table definition file, use thebq mk command tocreate the BigLake table:

bqmk--max_staleness=STALENESS_INTERVAL\--external_table_definition=DEFINITION_FILE\DATASET_NAME.TABLE_NAME\SCHEMA

Replace the following:

  • STALENESS_INTERVAL: specifies whethercached metadata is used by operations against the BigLaketable, and how fresh the cached metadata must be in order for theoperation to use it. For more information about metadata cachingconsiderations, seeMetadata caching for performance.

    To disable metadata caching, specify 0. This is the default.

    To enable metadata caching, specify aninterval literalvalue between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOUR for a 4 hour staleness interval.With this value, operations against the table use cached metadata ifit has been refreshed within the past 4 hours. If the cached metadatais older than that, the operation retrieves metadata fromAmazon S3 instead.

  • DEFINITION_FILE: the path to the table definition file.

  • DATASET_NAME: the name of the dataset that contains thetable.

  • TABLE_NAME: the name of the table you're creating.

  • SCHEMA: specifies a path to aJSON schema file,or specifies the schema in the formfield:data_type,field:data_type,.... To use schemaauto-detection, omit this argument.

Examples

The following example usesAUTO Hive partitioning mode for Amazon S3data:

bq mkdef --source_format=CSV \  --connection_id=us.my-connection \  --hive_partitioning_mode=AUTO \  --hive_partitioning_source_uri_prefix=s3://mybucket/myTable \  --metadata_cache_mode=AUTOMATIC \  s3://mybucket/* > mytable_defbq mk --max_staleness=INTERVAL "1" HOUR \  --external_table_definition=mytable_def \  mydataset.mytable \  Region:STRING,Quarter:STRING,Total_sales:INTEGER

The following example usesSTRING Hive partitioning mode for Amazon S3data:

bq mkdef --source_format=CSV \  --connection_id=us.my-connection \  --hive_partitioning_mode=STRING \  --hive_partitioning_source_uri_prefix=s3://mybucket/myTable \  --metadata_cache_mode=AUTOMATIC \  s3://mybucket/myTable/* > mytable_defbq mk --max_staleness=INTERVAL "1" HOUR \  --external_table_definition=mytable_def \  mydataset.mytable \  Region:STRING,Quarter:STRING,Total_sales:INTEGER

API

To set Hive partitioning using the BigQuery API, include thehivePartitioningOptionsobject in theExternalDataConfigurationobject when you create thetable definition file.To create a BigLake table, you must also specify a valuefor theconnectionId field.

If you set thehivePartitioningOptions.mode field toCUSTOM, you mustencode the partition key schema in thehivePartitioningOptions.sourceUriPrefix field as follows:s3://BUCKET/PATH_TO_TABLE/{KEY1:TYPE1}/{KEY2:TYPE2}/...

To enforce the use of a predicate filter at query time, set thehivePartitioningOptions.requirePartitionFilter field totrue.

Delta Lake tables

Delta Lake is an open source table format that supports petabyte scale datatables. Delta Lake tables can be queried as both temporary and permanent tables,and is supported as aBigLaketable.

Schema synchronization

Delta Lake maintains a canonical schema as part of its metadata. Youcan't update a schema using a JSON metadata file. To update the schema:

  1. Use thebq update commandwith the--autodetect_schema flag:

    bq update --autodetect_schemaPROJECT_ID:DATASET.TABLE

    Replace the following:

    • PROJECT_ID: the project ID containing the table that youwant to update

    • DATASET: the dataset containing the table that youwant to update

    • TABLE: the table that you want to update

Type conversion

BigQuery converts Delta Lake data types to the followingBigQuery data types:

Delta Lake TypeBigQuery Type
booleanBOOL
byteINT64
intINT64
longINT64
floatFLOAT64
doubleFLOAT64
Decimal(P/S)NUMERIC orBIG_NUMERIC depending on precision
dateDATE
timeTIME
timestamp (not partition column)TIMESTAMP
timestamp (partition column)DATETIME
stringSTRING
binaryBYTES
array<Type>ARRAY<Type>
structSTRUCT
map<KeyType, ValueType>ARRAY<Struct<key KeyType, value ValueType>>

Limitations

The following limitations apply to Delta Lake tables:

  • External table limitations applyto Delta Lake tables.

  • Delta Lake tables are only supported onBigQuery Omni and have the associatedlimitations.

  • You can't update a table with a new JSON metadata file. You must use an autodetect schema table update operation. SeeSchemasynchronization for more information.

  • BigLake security features only protect Delta Laketables when accessed through BigQuery services.

Create a Delta Lake table

The following example creates an external table by using theCREATE EXTERNALTABLEstatement with theDelta Lake format:

CREATE [OR REPLACE] EXTERNAL TABLEtable_nameWITH CONNECTIONconnection_nameOPTIONS (         format = 'DELTA_LAKE',         uris = ["parent_directory"]       );

Replace the following:

  • table_name: The name of the table.

  • connection_name: The name of the connection. The connection mustidentify either anAmazon S3 or aBlob Storage source.

  • parent_directory: The URI of the parent directory.

Cross-cloud transfer with Delta Lake

The following example uses theLOAD DATAstatement to load data to the appropriate table:

LOAD DATA [INTO | OVERWRITE]table_nameFROM FILES (        format = 'DELTA_LAKE',        uris = ["parent_directory"])WITH CONNECTIONconnection_name;

For more examples of cross-cloud data transfers, seeLoad data with cross cloudoperations.

Query BigLake tables

For more information, seeQuery Amazon S3 data.

View resource metadata

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

You can view the resource metadata withINFORMATION_SCHEMA views. When you query theJOBS_BY_*,JOBS_TIMELINE_BY_*, andRESERVATION*views, you mustspecify the query's processing locationthat is collocated with the table's region. For information about BigQuery Omnilocations, seeLocations. For allother system tables, specifying the query job location isoptional.

For information about the system tables that BigQuery Omni supports, seeLimitations.

To queryJOBS_* andRESERVATION* system tables, select one of the followingmethods to specify the processing location:

Console

  1. Go to theBigQuery page.

    Go to BigQuery

  2. If theEditor tab isn't visible, then clickCompose new query.

  3. ClickMore>Query settings. TheQuery settingsdialog opens.

  4. In theQuery settings dialog, forAdditional settings>Data location, select theBigQuery regionthat is collocated with the BigQuery Omni region.For example, if your BigQuery Omni region isaws-us-east-1,specifyus-east4.

  5. Select the remaining fields and clickSave.

bq

Use the--location flag to set the job's processing location to theBigQuery region that iscollocated with the BigQuery Omni region.For example, if your BigQuery Omni region isaws-us-east-1,specifyus-east4.

Example

bqquery--use_legacy_sql=false--location=us-east4\"SELECT * FROM region-aws-us-east-1.INFORMATION_SCHEMA.JOBS limit 10;"
bqquery--use_legacy_sql=false--location=asia-northeast3\"SELECT * FROM region-aws-ap-northeast-2.INFORMATION_SCHEMA.JOBS limit 10;"

API

If you arerunning jobs programmatically,set the location argument to theBigQuery regionthat is collocated with the BigQuery Omni region.For example, if your BigQuery Omni region isaws-us-east-1,specifyus-east4.

The following example lists the metadata refresh jobs in last six hours:

SELECT*FROM`region-REGION_NAME`.INFORMATION_SCHEMA.JOBS_BY_PROJECTWHEREjob_idLIKE'%metadata_cache_refresh%'ANDcreation_time>TIMESTAMP_SUB(CURRENT_TIMESTAMP(),INTERVAL6HOUR)ORDERBYstart_timedescLIMIT10;

ReplaceREGION_NAME with your region.

VPC Service Controls

You can use VPC Service Controls perimeters to restrict access fromBigQuery Omni to an external cloud service as an extra layer ofdefense. For example, VPC Service Controls perimeters can limit exports fromyour BigQuery Omni tables to a specific Amazon S3 bucketor Blob Storage container.

To learn more about VPC Service Controls, seeOverview of VPC Service Controls.

Required permission

Ensure that you have the required permissions to configure service perimeters.To view a list of IAM roles required toconfigure VPC Service Controls, seeAccess control withIAM in theVPC Service Controls documentation.

Set up VPC Service Controls using the Google Cloud console

  1. In the Google Cloud console navigation menu, clickSecurity, and then clickVPC Service Controls.

    Go to VPC Service Controls

  2. To set up VPC Service Controls for BigQuery Omni, follow the steps in theCreate a service perimeter guide, and when you are in theEgress rules pane, follow these steps:

    1. In theEgress rules panel, clickAdd rule.

    2. In theFrom attributes of the API client section, select an optionfrom theIdentity list.

    3. SelectTo attributes of external resources.

    4. To add an external resource, clickAdd external resources.

    5. In theAdd external resource dialog, forExternal resource name,enter a valid resource name. For example:

      • For Amazon Simple Storage Service (Amazon S3):s3://BUCKET_NAME

        ReplaceBUCKET_NAME with the name of your Amazon S3 bucket.

      • For Azure Blob Storage:azure://myaccount.blob.core.windows.net/CONTAINER_NAME

        ReplaceCONTAINER NAME with the name of your Blob Storagecontainer.

      For a list of egress rule attributes, seeEgress rules reference.

    6. Select the methods that you want to allow on your external resources:

      1. If you want to allow all methods, selectAll methods in theMethods list.
      2. If you want to allow specific methods, selectSelected method,clickSelect methods, and then select the methods that youwant to allow on your external resources.
    7. ClickCreate perimeter.

Set up VPC Service Controls using the gcloud CLI

To set up VPC Service Controls using the gcloud CLI, follow thesesteps:

  1. Set the default access policy.
  2. Create the egress policy input file.
  3. Add the egress policy.

Set the default access policy

An access policy is an organization-wide containerfor access levels and service perimeters. For information about setting adefault access policy or getting an access policy name, seeManaging an accesspolicy.

Create the egress policy input file

An egress rule block defines the allowed access from within a perimeter to resourcesoutside of that perimeter. For external resources, theexternalResources propertydefines the external resource paths allowed access from within yourVPC Service Controls perimeter.

Egress rules can be configured usinga JSON file, or a YAML file. The following sample uses the.yaml format:

-egressTo:operations:-serviceName:bigquery.googleapis.commethodSelectors:-method:"*"*OR*-permission:"externalResource.read"externalResources:-EXTERNAL_RESOURCE_PATHegressFrom:identityType:IDENTITY_TYPE*OR*identities:-serviceAccount:SERVICE_ACCOUNT
  • egressTo: lists allowed service operations on Google Cloud resourcesin specified projects outside the perimeter.

  • operations: list accessible services and actions or methods that aclient satisfying thefrom block conditions is allowed to access.

  • serviceName: setbigquery.googleapis.com for BigQuery Omni.

  • methodSelectors: list methods that a client satisfying thefrom conditionscan access. For restrictable methods and permissions for services, seeSupported service method restrictions.

  • method : a valid service method, or\"*\" to allow allserviceName methods.

  • permission: a valid service permission, such as\"*\",externalResource.read, orexternalResource.write. Access to resourcesoutside the perimeter is allowed for operations that require this permission.

  • externalResources: lists external resources that clients inside a perimetercan access. ReplaceEXTERNAL_RESOURCE_PATH with either a validAmazon S3 bucket, such ass3://bucket_name, or aBlob Storage container path, such asazure://myaccount.blob.core.windows.net/container_name.

  • egressFrom: lists allowed service operations on Google Cloudresources in specified projects within the perimeter.

  • identityType oridentities: defines the identity types that can access thespecified resources outside the perimeter. ReplaceIDENTITY_TYPEwith one of the following valid values:

    • ANY_IDENTITY: to allow all identities.
    • ANY_USER_ACCOUNT: to allow all users.
    • ANY_SERVICE_ACCOUNT: to allow all service accounts
  • identities: lists service accounts that can access the specified resourcesoutside the perimeter.

  • serviceAccount (optional): replaceSERVICE_ACCOUNT with theservice account that can access the specified resources outside theperimeter.

Examples

The following example is a policy that allows egress operations from inside theperimeter to thes3://mybucket Amazon S3 location in AWS.

-egressTo:operations:-serviceName:bigquery.googleapis.commethodSelectors:-method:"*"externalResources:-s3://mybucket-s3://mybucket2egressFrom:identityType:ANY_IDENTITY

The following example allows egress operations to a Blob Storage container:

-egressTo:operations:-serviceName:bigquery.googleapis.commethodSelectors:-method:"*"externalResources:-azure://myaccount.blob.core.windows.net/mycontaineregressFrom:identityType:ANY_IDENTITY

For more information about egress policies, see theEgress rules reference.

Add the egress policy

To add the egress policy when you create a new service perimeter, use thegcloud access-context-manager perimeters create command.For example, the following command creates a newperimeter namedomniPerimeter that includes the project with project number12345, restricts the BigQuery API, and adds an egress policydefined in theegress.yaml file:

gcloudaccess-context-managerperimeterscreateomniPerimeter\--title="Omni Perimeter"\--resources=projects/12345\--restricted-services=bigquery.googleapis.com\--egress-policies=egress.yaml

To add the egress policy to an existing service perimeter, use thegcloud access-context-manager perimeters update command.For example, the following command adds an egress policy defined in theegress.yaml file to an existing service perimeter namedomniPerimeter:

gcloudaccess-context-managerperimetersupdateomniPerimeter--set-egress-policies=egress.yaml

Verify your perimeter

To verify the perimeter, use thegcloud access-context-manager perimeters describe command:

gcloudaccess-context-managerperimetersdescribePERIMETER_NAME

ReplacePERIMETER_NAME with the name of the perimeter.

For example, the following command describes the perimeteromniPerimeter:

gcloudaccess-context-managerperimetersdescribeomniPerimeter

For more information, seeManaging service perimeters.

Allow BigQuery Omni VPC access to Amazon S3

As a BigQuery administrator, you can create an S3 bucket policy togrant BigQuery Omni access to your Amazon S3 resources.This ensures that only authorized BigQuery Omni VPCs can interact withyour Amazon S3, enhancing the security of your data.

Apply an S3 bucket policy for BigQuery Omni VPC

To apply an S3 bucket policy, use the AWS CLI or Terraform:

AWS CLI

Run the following command to apply an S3 bucket policy that includes acondition using theaws:SourceVpc attribute:

awss3apiput-bucket-policy\--bucket=BUCKET_NAME\--policy"{      \"Version\": \"2012-10-17\",      \"Id\": \"RestrictBucketReads\",      \"Statement\": [          {              \"Sid\": \"AccessOnlyToOmniVPC\",              \"Principal\": \"*\",              \"Action\": [\"s3:ListBucket\", \"s3:GetObject\"],              \"Effect\": \"Allow\",              \"Resource\": [\"arn:aws:s3:::BUCKET_NAME\",                             \"arn:aws:s3:::BUCKET_NAME/*\"],              \"Condition\": {                  \"StringEquals\": {                    \"aws:SourceVpc\": \"VPC_ID\"                  }              }          }      ]    }"

Replace the following:

  • BUCKET_NAME: the Amazon S3 bucket thatyou want BigQuery to access.
  • VPC_ID: the BigQuery Omni VPC ID of theBigQuery Omni region collocated with the Amazon S3bucket. You can find this information in the table on this page.

Terraform

Add the following to your Terraform configuration file:

resource"aws_s3_bucket""example"{bucket="BUCKET_NAME"}resource"aws_s3_bucket_policy""example"{bucket=aws_s3_bucket.example.idpolicy=jsonencode({Version="2012-10-17"Id="RestrictBucketReads"Statement=[{Sid="AccessOnlyToOmniVPC"Effect="Allow"Principal="*"Action=["s3:GetObject", "s3:ListBucket"]Resource=[aws_s3_bucket.example.arn,"${aws_s3_bucket.example.arn}/*"]Condition={StringEquals={"aws:SourceVpc":"VPC_ID"}}},]})}

Replace the following:

  • BUCKET_NAME: the Amazon S3 bucket thatyou want BigQuery to access.
  • VPC_ID: the BigQuery Omni VPC ID of theBigQuery Omni region collocated with the Amazon S3bucket.

BigQuery Omni VPC Resource IDs

RegionVPC ID
aws-ap-northeast-2vpc-0b488548024288af2
aws-ap-southeast-2vpc-0726e08afef3667ca
aws-eu-central-1vpc-05c7bba12ad45558f
aws-eu-west-1vpc-0e5c646979bbe73a0
aws-us-east-1vpc-0bf63a2e71287dace
aws-us-west-2vpc-0cc24e567b9d2c1cb

Limitations

For a full list of limitations that apply to BigLake tablesbased on Amazon S3 and Blob Storage, seeLimitations.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.