Create AWS Glue federated datasets

This document describes how to create a federated dataset inBigQuery that's linked to an existing database in AWS Glue.

A federated dataset is a connection between BigQuery and anexternal data source at the dataset level. The tables in a federated dataset areautomatically populated from the tables in the corresponding external datasource. You can query these tables directly in BigQuery, but youcannot make modifications, additions, or deletions. However, any updates thatyou make in the external data source are automatically reflected inBigQuery.

Before you begin

Ensure that you have a connection to access AWS Glue data.

  • To create or modify a connection, follow the instructions inConnect to Amazon S3.When you create that connection, include the following policy statement forAWS Glue in yourAWS Identity and Access Management policy for BigQuery.Include this statement in addition to the other permissions on theAmazon S3 bucket where the data in your AWS Glue tables isstored.

    {"Effect":"Allow","Action":["glue:GetDatabase","glue:GetTable","glue:GetTables","glue:GetPartitions"],"Resource":["arn:aws:glue:REGION:ACCOUNT_ID:catalog","arn:aws:glue:REGION:ACCOUNT_ID:database/DATABASE_NAME","arn:aws:glue:REGION:ACCOUNT_ID:table/DATABASE_NAME/*"]}

    Replace the following:

    • REGION: the AWS region—for exampleus-east-1
    • ACCOUNT_ID:: the 12-digit AWS Account ID
    • DATABASE_NAME: the AWS Glue databasename

Required permissions

To get the permissions that you need to create a federated dataset, ask your administrator to grant you theBigQuery Admin (roles/bigquery.admin) IAM role. For more information about granting roles, seeManage access to projects, folders, and organizations.

This predefined role contains the permissions required to create a federated dataset. To see the exact permissions that are required, expand theRequired permissions section:

Required permissions

The following permissions are required to create a federated dataset:

  • bigquery.datasets.create
  • bigquery.connections.use
  • bigquery.connections.delegate

You might also be able to get these permissions withcustom roles or otherpredefined roles.

For more information about IAM roles and permissions inBigQuery, seeIntroduction to IAM.

Create a federated dataset

To create a federated dataset, do the following:

Console

  1. Open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the left pane, clickExplorer:

    Highlighted button for the Explorer pane.

    If you don't see the left pane, clickExpand left pane to open the pane.

  3. In theExplorer pane, select the project where you want to createthe dataset.

  4. ClickView actions, and then clickCreate dataset.

  5. On theCreate dataset page, do the following:

    • ForDataset ID, enter a unique dataset name.
    • ForLocation type, choose an AWS location for the dataset, such asaws-us-east-1. After you create a dataset, the location can't bechanged.
    • ForExternal Dataset, do the following:

      • Check the box next toLink to an external dataset.
      • ForExternal dataset type, selectAWS Glue.
      • ForExternal source, enteraws-glue:// followed by theAmazon Resource Name (ARN)of the AWS Glue database—for example,aws-glue://arn:aws:glue:us-east-1:123456789:database/test_database.
      • ForConnection ID, select your AWS connection.
    • Leave the other default settings as they are.

  6. ClickCreate dataset.

SQL

Use theCREATE EXTERNAL SCHEMA data definition language (DDL) statement.

  1. In the Google Cloud console, go to theBigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATEEXTERNALSCHEMADATASET_NAMEWITHCONNECTIONPROJECT_ID.CONNECTION_LOCATION.CONNECTION_NAMEOPTIONS(external_source='AWS_GLUE_SOURCE',location='LOCATION');

    Replace the following:

    • DATASET_NAME: the name of your new dataset in BigQuery.
    • PROJECT_ID: your project ID.
    • CONNECTION_LOCATION: the location of your AWS connection—for example,aws-us-east-1.
    • CONNECTION_NAME: the name of your AWS connection.
    • AWS_GLUE_SOURCE: theAmazon Resource Name (ARN) of the AWS Glue database with a prefix identifying the source—for example,aws-glue://arn:aws:glue:us-east-1:123456789:database/test_database.
    • LOCATION: the location of your new dataset in BigQuery—for example,aws-us-east-1. After you create a dataset, you can't change its location.

  3. ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

In a command-line environment, create a dataset by using thebq mk command:

bq--location=LOCATIONmk--dataset\--external_sourceaws-glue://AWS_GLUE_SOURCE\--connection_idPROJECT_ID.CONNECTION_LOCATION.CONNECTION_NAME\DATASET_NAME

Replace the following:

  • LOCATION: the location of your new dataset inBigQuery—for example,aws-us-east-1. After you create adataset, you can't change its location. You can set a default locationvalue by using the.bigqueryrc file.
  • AWS_GLUE_SOURCE: theAmazon Resource Name (ARN)of the AWS Glue database—for example,arn:aws:glue:us-east-1:123456789:database/test_database.
  • PROJECT_ID: your BigQuery projectID.
  • CONNECTION_LOCATION: the location of your AWSconnection—for example,aws-us-east-1.
  • CONNECTION_NAME: the name of your AWS connection.
  • DATASET_NAME: the name of your new dataset inBigQuery. To create a dataset in a project other than yourdefault project, add the project ID to the dataset name in the followingformat:PROJECT_ID:DATASET_NAME.

Terraform

Use thegoogle_bigquery_dataset resource.

Note: To create BigQuery objects using Terraform, you mustenable theCloud Resource Manager API.

To authenticate to BigQuery, set up Application DefaultCredentials. For more information, seeSet up authentication for client libraries.

The following example creates an AWS Glue federated dataset:

resource"google_bigquery_dataset""dataset"{provider=google-betadataset_id="example_dataset"friendly_name="test"description="This is a test description."location="aws-us-east-1"external_dataset_reference{external_source="aws-glue://arn:aws:glue:us-east-1:999999999999:database/database"connection="projects/project/locations/aws-us-east-1/connections/connection"}}

To apply your Terraform configuration in a Google Cloud project, complete the steps in the following sections.

Prepare Cloud Shell

  1. LaunchCloud Shell.
  2. Set the default Google Cloud project where you want to apply your Terraform configurations.

    You only need to run this command once per project, and you can run it in any directory.

    export GOOGLE_CLOUD_PROJECT=PROJECT_ID

    Environment variables are overridden if you set explicit values in the Terraform configuration file.

Prepare the directory

Each Terraform configuration file must have its own directory (alsocalled aroot module).

  1. InCloud Shell, create a directory and a new file within that directory. The filename must have the.tf extension—for examplemain.tf. In this tutorial, the file is referred to asmain.tf.
    mkdirDIRECTORY && cdDIRECTORY && touch main.tf
  2. If you are following a tutorial, you can copy the sample code in each section or step.

    Copy the sample code into the newly createdmain.tf.

    Optionally, copy the code from GitHub. This is recommended when the Terraform snippet is part of an end-to-end solution.

  3. Review and modify the sample parameters to apply to your environment.
  4. Save your changes.
  5. Initialize Terraform. You only need to do this once per directory.
    terraform init

    Optionally, to use the latest Google provider version, include the-upgrade option:

    terraform init -upgrade

Apply the changes

  1. Review the configuration and verify that the resources that Terraform is going to create or update match your expectations:
    terraform plan

    Make corrections to the configuration as necessary.

  2. Apply the Terraform configuration by running the following command and enteringyes at the prompt:
    terraform apply

    Wait until Terraform displays the "Apply complete!" message.

  3. Open your Google Cloud project to view the results. In the Google Cloud console, navigate to your resources in the UI to make sure that Terraform has created or updated them.
Note: Terraform samples typically assume that the required APIs are enabled in your Google Cloud project.

API

Call thedatasets.insert methodwith a defineddataset resourceandexternalDatasetReference fieldfor your AWS Glue database.

List tables in a federated dataset

To list the tables that are available for query in your federated dataset, seeListing datasets.

Get table information

To get information on the tables in your federated dataset, such as schemadetails, seeGet table information.

Control access to tables

To manage access to the tables in your federated dataset, seeControl access to resources with IAM.

Row-level security,column-level security, anddata masking are also supported fortables in federated datasets.

Schema operations that might invalidate security policies, such as deleting acolumn in AWS Glue, can cause jobs to fail until the policies areupdated. Additionally, if you delete a table in AWS Glue and recreateit, your security policies no longer apply to the recreated table.

Query AWS Glue data

Querying tables in federated datasets is thesame as querying tables in any other BigQuery dataset.

You can query AWS Glue tables in the following formats:

  • CSV (compressed and uncompressed)
  • JSON (compressed and uncompressed)
  • Parquet
  • ORC
  • Avro
  • Iceberg
  • Delta Lake

Table mapping details

Every table that you grant access to in your AWS Glue databaseappears as an equivalent table in your BigQuery dataset.

Format

The format of each BigQuery table is determined by the followingfields of the respectiveAWS Glue table:

  • InputFormat (Table.StorageDescriptor.InputFormat)
  • OutputFormat (Table.StorageDescriptor.OutputFormat)
  • SerializationLib (Table.StorageDescriptor.SerdeInfo.SerializationLibrary)

The only exception is Iceberg tables, which use theTableType(Table.Parameters["table_type"]) field.

For example, an AWS Glue table with the following fields is mapped toan ORC table in BigQuery:

  • InputFormat ="org.apache.hadoop.hive.ql.io.orc.OrcInputFormat"
  • OutputFormat ="org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"
  • SerializationLib ="org.apache.hadoop.hive.ql.io.orc.OrcSerde"

Location

The location of each BigQuery table is determined by thefollowing:

  • Iceberg tables: theTable.Parameters["metadata_location"] field in theAWS Glue table
  • Non-Iceberg unpartitioned tables: theTable.StorageDescriptor.Locationfield in the AWS Glue table
  • Non-Iceberg partitioned tables: the AWS Glue GetPartitions API

Other properties

Additionally, some AWS Glue table properties are automatically mappedto format-specific options in BigQuery:

FormatSerializationLibAWS Glue table valueBigQuery option
CSVLazySimpleSerDeTable.StorageDescriptor.SerdeInfo.Parameters["field.delim"]CsvOptions.fieldDelimiter
CSVLazySimpleSerDeTable.StorageDescriptor.Parameters["serialization.encoding"]CsvOptions.encoding
CSVLazySimpleSerDeTable.StorageDescriptor.Parameters["skip.header.line.count"]CsvOptions.skipLeadingRows
CSVOpenCsvSerDeTable.StorageDescriptor.SerdeInfo.Parameters["separatorChar"]CsvOptions.fieldDelimiter
CSVOpenCsvSerDeTable.StorageDescriptor.SerdeInfo.Parameters["quoteChar"]CsvOptions.quote
CSVOpenCsvSerDeTable.StorageDescriptor.Parameters["serialization.encoding"]CsvOptions.encoding
CSVOpenCsvSerDeTable.StorageDescriptor.Parameters["skip.header.line.count"]CsvOptions.skipLeadingRows
JSONHiveJsonSerDeTable.StorageDescriptor.Parameters["serialization.encoding"]JsonOptions.encoding

Create a view in a federated dataset

You can't create a view in a federated dataset. However, you can create a viewin a standard dataset that's based on a table in a federated dataset. For moreinformation, seeCreate views.

Delete a federated dataset

Deleting a federated dataset is the same as deleting any otherBigQuery dataset. For more information, seeDelete datasets.

Pricing

For information about pricing, seeBigQuery Omni pricing.

Limitations

  • AllBigQuery Omni limitationsapply.
  • You can't add, delete, or update data or metadata in tables in anAWS Glue federated dataset.
  • You can't create new tables, views, or materialized views in anAWS Glue federated dataset.
  • INFORMATION_SCHEMA views aren'tsupported.
  • Metadata cachingisn't supported.
  • Dataset-level settings that are related to table creation defaults don'taffect federated datasets because you can't create tables manually.
  • The Apache Hive data typeUNION isn't supported for Avro tables.
  • External table limitations apply.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.