Introduction to Amazon S3 transfers

The BigQuery Data Transfer Service for Amazon S3 lets you automatically schedule andmanage recurring load jobs from Amazon S3 into BigQuery.

Supported file formats

The BigQuery Data Transfer Service supports loading data from Amazon S3 in one of thefollowing formats:

  • Comma-separated values (CSV)
  • JSON (newline-delimited)
  • Avro
  • Parquet
  • ORC

Supported compression types

The BigQuery Data Transfer Service for Amazon S3 supports loading compressed data. Thecompression types supported by BigQuery Data Transfer Service are the same as thecompression types supported by BigQuery load jobs. For moreinformation, seeLoading compressed and uncompressed data.

Amazon S3 prerequisites

To load data from an Amazon S3 data source, you must:

  • Provide the Amazon S3 URI for your source data
  • Have your access key ID
  • Have your secret access key
  • Set, at a minimum, the AWS managed policyAmazonS3ReadOnlyAccess on your Amazon S3 source data

Amazon S3 URIs

When you supply the Amazon S3 URI, the path must be in the following formats3://bucket/folder1/folder2/... Only the top-level bucket name is required.Folder names are optional. If you specify a URI that includes only the bucketname, all files in the bucket are transferred and loaded intoBigQuery.

Amazon S3 transfer runtime parameterization

The Amazon S3 URI and the destination table can both beparameterized,allowing you to load data from Amazon S3 buckets organized by date. Note thatthe bucket portion of the URI cannot be parameterized. The parametersused by Amazon S3 transfers are the same as those used by Cloud Storagetransfers.

For details, seeRuntime parameters in transfers.

Data ingestion for Amazon S3 transfers

You can specify how data is loaded into BigQuery by selecting aWrite Preference in the transfer configuration when youset up an Amazon S3 transfer.

There are two types of write preferences available,incremental transfersandtruncated transfers.

Incremental transfers

A transfer configuration with anAPPEND orWRITE_APPEND writepreference, also called an incremental transfer, incrementally appends new datasince the previous successful transfer to a BigQuery destinationtable. When a transfer configuration runs with anAPPEND write preference,theBigQuery Data Transfer Service filters for files which have been modified since theprevious successful transfer run. To determine when a file is modified,BigQuery Data Transfer Service looks at the file metadata for a "last modified time"property. For example, the BigQuery Data Transfer Service looks at theupdated timestamp propertyin a Cloud Storage file. If theBigQuery Data Transfer Service finds any files with a "last modified time" that haveoccurred after the timestamp of the last successful transfer, theBigQuery Data Transfer Service transfers those files in an incremental transfer.

To demonstrate how incremental transfers work, consider the followingCloud Storage transfer example. A user creates a file in aCloud Storage bucket at time 2023-07-01T00:00Z namedfile_1. Theupdated timestamp forfile_1 isthe time that the file was created. The user thencreates an incremental transfer from the Cloud Storage bucket,scheduled to run once daily at time 03:00Z, starting from 2023-07-01T03:00Z.

  • At 2023-07-01T03:00Z, the first transfer run starts. As this is the firsttransfer run for this configuration, BigQuery Data Transfer Service attempts toload all files matching the source URI into the destinationBigQuery table. The transfer run succeeds andBigQuery Data Transfer Service successfully loadsfile_1 into the destinationBigQuery table.
  • The next transfer run, at 2023-07-02T03:00Z, detects no files where theupdated timestamp property is greaterthan the last successful transfer run (2023-07-01T03:00Z). The transfer runsucceeds without loading any additional data into the destinationBigQuery table.

The preceding example shows how the BigQuery Data Transfer Service looks at theupdated timestamp property of the source file to determine if any changes weremade to the source files, and to transfer those changes if any were detected.

Following the same example, suppose that the user then creates another file inthe Cloud Storage bucket at time 2023-07-03T00:00Z, namedfile_2. Theupdated timestamp forfile_2 isthe time that the file was created.

  • The next transfer run, at 2023-07-03T03:00Z, detects thatfile_2 has anupdated timestamp greater than the last successful transfer run(2023-07-01T03:00Z). Suppose that when the transfer run starts it fails due toa transient error. In this scenario,file_2 is not loaded into thedestination BigQuery table. The last successful transferrun timestamp remains at 2023-07-01T03:00Z.
  • The next transfer run, at 2023-07-04T03:00Z, detects thatfile_2 has anupdated timestamp greater than the last successful transfer run(2023-07-01T03:00Z). This time, the transfer run completes without issue, soit successfully loadsfile_2 into the destination BigQuery table.
  • The next transfer run, at 2023-07-05T03:00Z, detects no files where theupdated timestamp is greater than the last successful transfer run(2023-07-04T03:00Z). The transfer run succeeds without loading any additionaldata into the destination BigQuery table.

The preceding example shows that when a transfer fails, no files aretransferred to the BigQuery destination table. Any file changesare transferred at the next successful transfer run. Any subsequentsuccessful transfers following a failed transfer does not cause duplicatedata. In the case of a failed transfer, you can also choose tomanually trigger a transferoutside its regularly scheduled time.

Warning: BigQuery Data Transfer Service relies on the "last modified time" property in each source file to determine which files to transfer, as seen in the incremental transfer examples. Modifying these properties can cause the transfer to skip certain files, or load the same file multiple times. This property can have different names in each storage system supported by BigQuery Data Transfer Service. For example, Cloud Storage objects call this propertyupdated.

Truncated transfers

A transfer configuration with aMIRROR orWRITE_TRUNCATE writepreference, also called a truncated transfer, overwrites data in theBigQuery destination table during each transfer run with datafrom all files matching the source URI.MIRROR overwrites a fresh copy ofdata in the destination table. If the destination table is using a partitiondecorator, the transfer run only overwrites data in the specified partition. Adestination table with a partition decorator has the formatmy_table${run_date}—for example,my_table$20230809.

Repeating the same incremental or truncated transfers in a day does not causeduplicate data. However, if you run multiple different transferconfigurations that affect the same BigQuery destination table,this can cause the BigQuery Data Transfer Service to duplicate data.

Wildcard support for Amazon S3 URIs

If your source data is separated into multiple files that share a common basename, you can use a wildcard in the URI when you load the data. A wildcardconsists of an asterisk (*), and can be used anywhere in the Amazon S3 URIexcept for the bucket name.

While more than one wildcard can be used in the Amazon S3 URI, some optimizationis possible when the Amazon S3 URI specifies only a single wildcard:

  • There is ahigher limit on the maximum number of filesper transfer run.

  • The wildcard will span directory boundaries. For example, the Amazon S3 URIs3://my-bucket/*.csv will match the files3://my-bucket/my-folder/my-subfolder/my-file.csv.

Amazon S3 URI examples

Example 1

To load a single file from Amazon S3 into BigQuery, specify theAmazon S3 URI of the file.

s3://my-bucket/my-folder/my-file.csv

Example 2

To load all files from an Amazon S3 bucket into BigQuery, specifyonly the bucket name, with or without a wildcard.

s3://my-bucket/

or

s3://my-bucket/*

Note thats3://my-bucket* is not a permitted Amazon S3 URI, as a wildcardcan't be used in the bucket name.

Example 3

To load all files from Amazon S3 that share a common prefix, specify the commonprefix followed by a wildcard.

s3://my-bucket/my-folder/*

Note that in contrast to loading all files from a top level Amazon S3 bucket,the wildcard must be specified at the end of the Amazon S3 URI for any files tobe loaded.

Example 4

To load all files from Amazon S3 with a similar path, specify the common prefixfollowed by a wildcard.

s3://my-bucket/my-folder/*.csv

Example 5

Note the wildcards span directories, so anycsv files inmy-folder, as wellas in subfolders ofmy-folder will be loaded into BigQuery.

If you have these source files under alogs folder:

s3://my-bucket/logs/logs.csvs3://my-bucket/logs/system/logs.csvs3://my-bucket/logs/some-application/system_logs.logs3://my-bucket/logs/logs_2019_12_12.csv

then the following identifies them:

s3://my-bucket/logs/*

Example 6

If you have these source files, but want to transfer only those that havelogs.csv as the filename:

s3://my-bucket/logs.csvs3://my-bucket/metadata.csvs3://my-bucket/system/logs.csvs3://my-bucket/system/users.csvs3://my-bucket/some-application/logs.csvs3://my-bucket/some-application/output.csv

then the following identifies the files withlogs.csv in the name:

s3://my-bucket/*logs.csv

Example 7

By using multiple wildcards, more control can be achieved over which files aretransferred, at the cost oflower limits. Using multiplewildcards means that each wildcard will only match up to the end of a pathwithin a subdirectory. For example, for the following source files in Amazon S3:

s3://my-bucket/my-folder1/my-file1.csvs3://my-bucket/my-other-folder2/my-file2.csvs3://my-bucket/my-folder1/my-subfolder/my-file3.csvs3://my-bucket/my-other-folder2/my-subfolder/my-file4.csv

If the intention is to only transfermy-file1.csv andmy-file2.csv, use thefollowing as the value for the Amazon S3 URI:

s3://my-bucket/*/*.csv

As neither wildcard spans directories, this URI would limit the transfer to onlythe CSV files that are inmy-folder1 andmy-other-folder2. Subfolders wouldnot be included in the transfer.

AWS access keys

The access key ID and secret access key are used to access the Amazon S3 data onyour behalf. As a best practice, create a unique access key ID and secret accesskey specifically for Amazon S3 transfers to give minimal access to theBigQuery Data Transfer Service. For information on managing your access keys, see theAWS general reference documentation.

IP restrictions

If you use IP restrictions for access to Amazon S3, you must add the IPranges used by BigQuery Data Transfer Service workers to your list of allowed IPs.

To add IP ranges as allowed public IP addresses to Amazon S3, seeIP restrictions.

Consistency considerations

When you transfer data from Amazon S3, it is possible that some of your datawon't be transferred to BigQuery, particularly if the fileswere added to the bucket very recently. It should take approximately 5 minutesfor a file to become available to the BigQuery Data Transfer Service after it is added tothe bucket.

Important: To reduce the possibility of missing data, schedule your Amazon S3transfers to occur at least 5 minutes after your files are added to the bucket.

Outbound data transfer costs best practice

Transfers from Amazon S3 could fail if the destination table has not beenconfigured properly. Reasons that could result in an improper configurationinclude:

  • The destination table does not exist.
  • The table schema is not defined.
  • The table schema is not compatible with the data being transferred.

To avoid Amazon S3 outbound data transfer costs, you should first test a transferwith a small but representative subset of the files. Small means the test shouldhave a small data size, and a small file count.

Pricing

For information on BigQuery Data Transfer Service pricing, see thePricingpage.

Note that costs can be incurred outside of Google by using this service. Review theAmazon S3 pricing pagefor details.

Quotas and limits

The BigQuery Data Transfer Service uses load jobs to load Amazon S3 data intoBigQuery. All BigQueryQuotas andlimits on load jobs apply to recurringAmazon S3 transfers, with the following additional considerations:

ValueLimit
Maximum size per load job transfer run15 TB
Maximum number of files per transfer run when the Amazon S3 URI includes 0 or 1 wildcards10,000,000 files
Maximum number of files per transfer run when the Amazon S3 URI includes more than 1 wildcard10,000 files

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.