Create AWS Glue federated datasets
This document describes how to create a federated dataset inBigQuery that's linked to an existing database in AWS Glue.
A federated dataset is a connection between BigQuery and anexternal data source at the dataset level. The tables in a federated dataset areautomatically populated from the tables in the corresponding external datasource. You can query these tables directly in BigQuery, but youcannot make modifications, additions, or deletions. However, any updates thatyou make in the external data source are automatically reflected inBigQuery.
Before you begin
Ensure that you have a connection to access AWS Glue data.
To create or modify a connection, follow the instructions inConnect to Amazon S3.When you create that connection, include the following policy statement forAWS Glue in yourAWS Identity and Access Management policy for BigQuery.Include this statement in addition to the other permissions on theAmazon S3 bucket where the data in your AWS Glue tables isstored.
{"Effect":"Allow","Action":["glue:GetDatabase","glue:GetTable","glue:GetTables","glue:GetPartitions"],"Resource":["arn:aws:glue:REGION:ACCOUNT_ID:catalog","arn:aws:glue:REGION:ACCOUNT_ID:database/DATABASE_NAME","arn:aws:glue:REGION:ACCOUNT_ID:table/DATABASE_NAME/*"]}
Replace the following:
REGION: the AWS region—for exampleus-east-1ACCOUNT_ID:: the 12-digit AWS Account IDDATABASE_NAME: the AWS Glue databasename
Required permissions
To get the permissions that you need to create a federated dataset, ask your administrator to grant you theBigQuery Admin (roles/bigquery.admin) IAM role. For more information about granting roles, seeManage access to projects, folders, and organizations.
This predefined role contains the permissions required to create a federated dataset. To see the exact permissions that are required, expand theRequired permissions section:
Required permissions
The following permissions are required to create a federated dataset:
bigquery.datasets.createbigquery.connections.usebigquery.connections.delegate
You might also be able to get these permissions withcustom roles or otherpredefined roles.
For more information about IAM roles and permissions inBigQuery, seeIntroduction to IAM.
Create a federated dataset
To create a federated dataset, do the following:
Console
Open the BigQuery page in the Google Cloud console.
In the left pane, clickExplorer:

If you don't see the left pane, clickExpand left pane to open the pane.
In theExplorer pane, select the project where you want to createthe dataset.
ClickView actions, and then clickCreate dataset.
On theCreate dataset page, do the following:
- ForDataset ID, enter a unique dataset name.
- ForLocation type, choose an AWS location for the dataset, such as
aws-us-east-1. After you create a dataset, the location can't bechanged. ForExternal Dataset, do the following:
- Check the box next toLink to an external dataset.
- ForExternal dataset type, select
AWS Glue. - ForExternal source, enter
aws-glue://followed by theAmazon Resource Name (ARN)of the AWS Glue database—for example,aws-glue://arn:aws:glue:us-east-1:123456789:database/test_database. - ForConnection ID, select your AWS connection.
Leave the other default settings as they are.
ClickCreate dataset.
SQL
Use theCREATE EXTERNAL SCHEMA data definition language (DDL) statement.
In the Google Cloud console, go to theBigQuery page.
In the query editor, enter the following statement:
CREATEEXTERNALSCHEMADATASET_NAMEWITHCONNECTIONPROJECT_ID.CONNECTION_LOCATION.CONNECTION_NAMEOPTIONS(external_source='AWS_GLUE_SOURCE',location='LOCATION');
Replace the following:
DATASET_NAME: the name of your new dataset in BigQuery.PROJECT_ID: your project ID.CONNECTION_LOCATION: the location of your AWS connection—for example,aws-us-east-1.CONNECTION_NAME: the name of your AWS connection.AWS_GLUE_SOURCE: theAmazon Resource Name (ARN) of the AWS Glue database with a prefix identifying the source—for example,aws-glue://arn:aws:glue:us-east-1:123456789:database/test_database.LOCATION: the location of your new dataset in BigQuery—for example,aws-us-east-1. After you create a dataset, you can't change its location.
ClickRun.
For more information about how to run queries, seeRun an interactive query.
bq
In a command-line environment, create a dataset by using thebq mk command:
bq--location=LOCATIONmk--dataset\--external_sourceaws-glue://AWS_GLUE_SOURCE\--connection_idPROJECT_ID.CONNECTION_LOCATION.CONNECTION_NAME\DATASET_NAME
Replace the following:
LOCATION: the location of your new dataset inBigQuery—for example,aws-us-east-1. After you create adataset, you can't change its location. You can set a default locationvalue by using the.bigqueryrcfile.AWS_GLUE_SOURCE: theAmazon Resource Name (ARN)of the AWS Glue database—for example,arn:aws:glue:us-east-1:123456789:database/test_database.PROJECT_ID: your BigQuery projectID.CONNECTION_LOCATION: the location of your AWSconnection—for example,aws-us-east-1.CONNECTION_NAME: the name of your AWS connection.DATASET_NAME: the name of your new dataset inBigQuery. To create a dataset in a project other than yourdefault project, add the project ID to the dataset name in the followingformat:PROJECT_ID:DATASET_NAME.
Terraform
Use thegoogle_bigquery_dataset resource.
To authenticate to BigQuery, set up Application DefaultCredentials. For more information, seeSet up authentication for client libraries.
The following example creates an AWS Glue federated dataset:
resource"google_bigquery_dataset""dataset"{provider=google-betadataset_id="example_dataset"friendly_name="test"description="This is a test description."location="aws-us-east-1"external_dataset_reference{external_source="aws-glue://arn:aws:glue:us-east-1:999999999999:database/database"connection="projects/project/locations/aws-us-east-1/connections/connection"}}
To apply your Terraform configuration in a Google Cloud project, complete the steps in the following sections.
Prepare Cloud Shell
- LaunchCloud Shell.
Set the default Google Cloud project where you want to apply your Terraform configurations.
You only need to run this command once per project, and you can run it in any directory.
export GOOGLE_CLOUD_PROJECT=PROJECT_ID
Environment variables are overridden if you set explicit values in the Terraform configuration file.
Prepare the directory
Each Terraform configuration file must have its own directory (alsocalled aroot module).
- InCloud Shell, create a directory and a new file within that directory. The filename must have the
.tfextension—for examplemain.tf. In this tutorial, the file is referred to asmain.tf.mkdirDIRECTORY && cdDIRECTORY && touch main.tf
If you are following a tutorial, you can copy the sample code in each section or step.
Copy the sample code into the newly created
main.tf.Optionally, copy the code from GitHub. This is recommended when the Terraform snippet is part of an end-to-end solution.
- Review and modify the sample parameters to apply to your environment.
- Save your changes.
- Initialize Terraform. You only need to do this once per directory.
terraform init
Optionally, to use the latest Google provider version, include the
-upgradeoption:terraform init -upgrade
Apply the changes
- Review the configuration and verify that the resources that Terraform is going to create or update match your expectations:
terraform plan
Make corrections to the configuration as necessary.
- Apply the Terraform configuration by running the following command and entering
yesat the prompt:terraform apply
Wait until Terraform displays the "Apply complete!" message.
- Open your Google Cloud project to view the results. In the Google Cloud console, navigate to your resources in the UI to make sure that Terraform has created or updated them.
API
Call thedatasets.insert methodwith a defineddataset resourceandexternalDatasetReference fieldfor your AWS Glue database.
List tables in a federated dataset
To list the tables that are available for query in your federated dataset, seeListing datasets.
Get table information
To get information on the tables in your federated dataset, such as schemadetails, seeGet table information.
Control access to tables
To manage access to the tables in your federated dataset, seeControl access to resources with IAM.
Row-level security,column-level security, anddata masking are also supported fortables in federated datasets.
Schema operations that might invalidate security policies, such as deleting acolumn in AWS Glue, can cause jobs to fail until the policies areupdated. Additionally, if you delete a table in AWS Glue and recreateit, your security policies no longer apply to the recreated table.
Query AWS Glue data
Querying tables in federated datasets is thesame as querying tables in any other BigQuery dataset.
You can query AWS Glue tables in the following formats:
- CSV (compressed and uncompressed)
- JSON (compressed and uncompressed)
- Parquet
- ORC
- Avro
- Iceberg
- Delta Lake
Table mapping details
Every table that you grant access to in your AWS Glue databaseappears as an equivalent table in your BigQuery dataset.
Format
The format of each BigQuery table is determined by the followingfields of the respectiveAWS Glue table:
InputFormat(Table.StorageDescriptor.InputFormat)OutputFormat(Table.StorageDescriptor.OutputFormat)SerializationLib(Table.StorageDescriptor.SerdeInfo.SerializationLibrary)
The only exception is Iceberg tables, which use theTableType(Table.Parameters["table_type"]) field.
For example, an AWS Glue table with the following fields is mapped toan ORC table in BigQuery:
InputFormat="org.apache.hadoop.hive.ql.io.orc.OrcInputFormat"OutputFormat="org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"SerializationLib="org.apache.hadoop.hive.ql.io.orc.OrcSerde"
Location
The location of each BigQuery table is determined by thefollowing:
- Iceberg tables: the
Table.Parameters["metadata_location"]field in theAWS Glue table - Non-Iceberg unpartitioned tables: the
Table.StorageDescriptor.Locationfield in the AWS Glue table - Non-Iceberg partitioned tables: the AWS Glue GetPartitions API
Other properties
Additionally, some AWS Glue table properties are automatically mappedto format-specific options in BigQuery:
| Format | SerializationLib | AWS Glue table value | BigQuery option |
|---|---|---|---|
| CSV | LazySimpleSerDe | Table.StorageDescriptor.SerdeInfo.Parameters["field.delim"] | CsvOptions.fieldDelimiter |
| CSV | LazySimpleSerDe | Table.StorageDescriptor.Parameters["serialization.encoding"] | CsvOptions.encoding |
| CSV | LazySimpleSerDe | Table.StorageDescriptor.Parameters["skip.header.line.count"] | CsvOptions.skipLeadingRows |
| CSV | OpenCsvSerDe | Table.StorageDescriptor.SerdeInfo.Parameters["separatorChar"] | CsvOptions.fieldDelimiter |
| CSV | OpenCsvSerDe | Table.StorageDescriptor.SerdeInfo.Parameters["quoteChar"] | CsvOptions.quote |
| CSV | OpenCsvSerDe | Table.StorageDescriptor.Parameters["serialization.encoding"] | CsvOptions.encoding |
| CSV | OpenCsvSerDe | Table.StorageDescriptor.Parameters["skip.header.line.count"] | CsvOptions.skipLeadingRows |
| JSON | HiveJsonSerDe | Table.StorageDescriptor.Parameters["serialization.encoding"] | JsonOptions.encoding |
Create a view in a federated dataset
You can't create a view in a federated dataset. However, you can create a viewin a standard dataset that's based on a table in a federated dataset. For moreinformation, seeCreate views.
Delete a federated dataset
Deleting a federated dataset is the same as deleting any otherBigQuery dataset. For more information, seeDelete datasets.
Pricing
For information about pricing, seeBigQuery Omni pricing.
Limitations
- AllBigQuery Omni limitationsapply.
- You can't add, delete, or update data or metadata in tables in anAWS Glue federated dataset.
- You can't create new tables, views, or materialized views in anAWS Glue federated dataset.
INFORMATION_SCHEMAviews aren'tsupported.- Metadata cachingisn't supported.
- Dataset-level settings that are related to table creation defaults don'taffect federated datasets because you can't create tables manually.
- The Apache Hive data type
UNIONisn't supported for Avro tables. - External table limitations apply.
What's next
- Learn more aboutBigQuery Omni.
- Try theBigQuery Omni with AWS lab.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.