Load Avro data from Cloud Storage

Avro is an opensource data format that bundles serialized data with the data's schema in thesame file.

When you load Avro data from Cloud Storage, you can load the data into a newtable or partition, or you can append to or overwrite an existing table orpartition. When your data is loaded into BigQuery, it isconverted intocolumnar format for Capacitor(BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table,the dataset that contains the table must be in the same regional or multi-regional location as the Cloud Storage bucket.

For information about loading Avro data from a local file, seeLoading data into BigQuery from a local data source.

Limitations

You are subject to the following limitations when you load data intoBigQuery from a Cloud Storage bucket:

BigQuery does not guarantee data consistency for external datasources. Changes to the underlying data while a query is running can result inunexpected behavior.
BigQuery doesn't supportCloud Storage object versioning. If youinclude a generation number in the Cloud Storage URI, then the load jobfails.

The following limitations also apply when loading Avro files into BigQuery:

BigQuery doesn't support loading standaloneAvro schema (.avsc) files.
Nested array formatting isn't supported in BigQuery. Avrofiles that use this format must be converted before importing.
In an Avro file, names and namespaces for a fullname can only containalphanumeric characters and the underscore character_. The followingregular expression shows the allowed characters:[A-Za-z_][A-Za-z0-9_]*.

For information about BigQuery load job limits, seeLoad jobs.

Input file requirements

To avoidresourcesExceeded errors when loading Avro files intoBigQuery, follow these guidelines:

Keep row sizes to 50 MB or less.
If the row contains many array fields, or any very long array fields, breakthe array values into separate fields.

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessarypermissions to perform each task in this document, and create a dataset andtable to store your data.

Required permissions

To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.

Permissions to load data into BigQuery

To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:

bigquery.tables.create
bigquery.tables.updateData
bigquery.tables.update
bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:

roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin (includes thebigquery.jobs.create permission)
bigquery.user (includes thebigquery.jobs.create permission)
bigquery.jobUser (includes thebigquery.jobs.create permission)

Additionally, if you have thebigquery.datasets.create permission, you can create andupdate tables using a load job in the datasets that you create.

For more information on IAM roles and permissions inBigQuery, seePredefined roles and permissions.

Permissions to load data from Cloud Storage

To get the permissions that you need to load data from a Cloud Storage bucket, ask your administrator to grant you theStorage Admin (roles/storage.admin) IAM role on the bucket. For more information about granting roles, seeManage access to projects, folders, and organizations.

This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand theRequired permissions section:

Required permissions

The following permissions are required to load data from a Cloud Storage bucket:

storage.buckets.get
storage.objects.get
storage.objects.list (required if you are using a URIwildcard)

You might also be able to get these permissions withcustom roles or otherpredefined roles.

Create a dataset and table

To store your data, you must create aBigQuery dataset,and then create aBigQuery tablewithin that dataset.

Advantages of Avro

Avro is the preferred format for loading data into BigQuery.Loading Avro files has the following advantages over CSV and JSON (newlinedelimited):

The Avro binary format:
- Is faster to load. The data can be read in parallel, even if the datablocks arecompressed.
- Doesn't require typing or serialization.
- Is easier to parse because there are no encoding issues found in otherformats such as ASCII.
When you load Avro files into BigQuery, the table schema isautomatically retrieved from the self-describing source data.

Avro schemas

When you load Avro files into a new BigQuery table, the tableschema isautomatically retrieved using the source data. When BigQueryretrieves the schema from the source data, the alphabetically last file is used.

For example, you have the following Avro files in Cloud Storage:

gs://mybucket/00/  a.avro  z.avrogs://mybucket/01/  b.avro

Running this command in the bq command-line tool loads all of the files (as acomma-separated list), and the schema is derived frommybucket/01/b.avro:

bqload\--source_format=AVRO\dataset.table\"gs://mybucket/00/*.avro","gs://mybucket/01/*.avro"

When importing multiple Avro files with different Avro schemas, all schemasmust be compatible withAvro's schema resolution.

When BigQuery detects the schema, some Avro data types areconverted to BigQuery data types to make them compatible withGoogleSQL syntax. For more information, seeAvro conversions.

To provide a table schema for creating external tables, set thereferenceFileSchemaUriproperty in BigQuery API or
--reference_file_schema_uri parameter in bq command-line toolto the URL of the reference file.

For example,--reference_file_schema_uri="gs://mybucket/schema.avro".

You can also import schema into BigQuery, byspecifying a JSON schema file.

Avro compression

BigQuery supports the following compression codecs forAvro file contents:

Snappy
DEFLATE
ZSTD

Loading Avro data into a new table

To load Avro data from Cloud Storage into a new BigQuery table,select one of the following options:

Console

In the Google Cloud console, open the BigQuery page.
Go to BigQuery
In the left pane, clickExplorer:
If you don't see the left pane, clickExpand left pane to open the pane.
In theExplorer pane, expand the project, clickDatasets, andthen click the name of your dataset.
In the details pane, clickCreate table.
On theCreate table page, in theSource section:
- ForCreate table from, selectGoogle Cloud Storage.
- In the source field, browse to or enter theCloud Storage URI.Note that you cannot include multiple URIs in theGoogle Cloud console, butwildcards aresupported. The Cloud Storage bucket must be in the same locationas the dataset that contains the table you're creating.
- ForFile format, selectAvro.
On theCreate table page, in theDestination section:
- ForDataset name, choose the appropriate dataset.
- Verify thatTable type is set toNative table.
- In theTable name field, enter the name of the table you'recreating in BigQuery.
In theSchema section, no action is necessary. The schema isself-described in Avro files.
(Optional) To partition the table, choose your options in thePartition and cluster settings. For more information, seeCreating partitioned tables.
(Optional) ForPartitioning filter, click theRequire partitionfilter box to require users to include aWHERE clause that specifies thepartitions to query. Requiring a partition filter may reduce cost andimprove performance. For more information, seeRequire a partition filter in queries.This option is unavailable ifNo partitioning is selected.
(Optional) Tocluster thetable, in theClustering order box, enter between one and four fieldnames.
(Optional) ClickAdvanced options.
- ForWrite preference, leaveWrite if empty selected. Thisoption creates a new table and loads your data into it.
- ForUnknown values, leaveIgnore unknown values cleared.This option applies only to CSV and JSON files.
- ForEncryption, clickCustomer-managed key to use aCloud Key Management Service key.If you leave theGoogle-managed key setting,BigQueryencrypts the data at rest.
ClickCreate table.

Note: When you load data into an empty table by using theGoogle Cloud console, you cannot add a label, description, tableexpiration, or partition expiration.

After the table is created, you can update the table's expiration,description, and labels, but you cannot add a partition expiration after atable is created using the Google Cloud console. For more information, see Managing tables.

SQL

Use theLOAD DATA DDL statement.The following example loads an Avro file into the new tablemytable:

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery

In the query editor, enter the following statement:

LOADDATAOVERWRITEmydataset.mytableFROMFILES(format='avro',uris=['gs://bucket/path/file.avro']);

ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

Use thebq load command, specifyAVRO using the--source_formatflag, and include aCloud Storage URI.You can include a single URI, a comma-separated list of URIs, or a URIcontaining awildcard.

(Optional) Supply the--location flag and set the value to yourlocation.

Other optional flags include:

--time_partitioning_type: Enables time-based partitioning on a table andsets the partition type. Possible values areHOUR,DAY,MONTH, andYEAR. This flag is optional when you create atable partitioned on aDATE,DATETIME, orTIMESTAMP column. Thedefault partition type for time-based partitioning isDAY. You cannotchange the partitioning specification on an existing table.
--time_partitioning_expiration: An integer that specifies (in seconds)when a time-based partition should be deleted. The expiration time evaluatesto the partition's UTC date plus the integer value.
--time_partitioning_field: TheDATE orTIMESTAMP column used tocreate a partitioned table. If time-based partitioning is enabled withoutthis value, an ingestion-time partitioned table is created.
--require_partition_filter: When enabled, this option requires usersto include aWHERE clause that specifies the partitions to query.Requiring a partition filter may reduce cost and improve performance.For more information, seeRequire a partition filter in queries.
--clustering_fields: A comma-separated list of up to four column namesused to create aclustered table.
--destination_kms_key: The Cloud KMS key for encryption of thetable data.
For more information on partitioned tables, see:
- Creating partitioned tables
For more information on clustered tables, see:
- Creating and using clustered tables
For more information on table encryption, see:
- Protecting data with Cloud KMS keys

To load Avro data into BigQuery, enter the following command:

bq--location=locationload\--source_format=format\dataset.table\path_to_source

Replace the following:

location is your location. The--location flag is optional.For example, if you are using BigQuery in the Tokyo region,you can set the flag's value toasia-northeast1. You can set a defaultvalue for the location using the.bigqueryrc file.
format isAVRO.
dataset is an existing dataset.
table is the name of the table into which you're loading data.
path_to_source is a fully-qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.

Examples:

The following command loads data fromgs://mybucket/mydata.avro into atable namedmytable inmydataset.

bqload\--source_format=AVRO\mydataset.mytable\gs://mybucket/mydata.avro

The following command loads data fromgs://mybucket/mydata.avro into aningestion-time partitioned table namedmytable inmydataset.

bqload\--source_format=AVRO\--time_partitioning_type=DAY\mydataset.mytable\gs://mybucket/mydata.avro

The following command loads data fromgs://mybucket/mydata.avro into a newpartitioned table namedmytable inmydataset. The table is partitionedon themytimestamp column.

bqload\--source_format=AVRO\--time_partitioning_fieldmytimestamp\mydataset.mytable\gs://mybucket/mydata.avro

The following command loads data from multiple files ings://mybucket/into a table namedmytable inmydataset. The Cloud Storage URI uses awildcard.

bqload\--source_format=AVRO\mydataset.mytable\gs://mybucket/mydata*.avro

The following command loads data from multiple files ings://mybucket/into a table namedmytable inmydataset. The command includes a comma-separated list of Cloud Storage URIs with wildcards.

bqload\--source_format=AVRO\mydataset.mytable\"gs://mybucket/00/*.avro","gs://mybucket/01/*.avro"

API

Create aload job that points to the source data in Cloud Storage.
(Optional) Specify yourlocation inthelocation property in thejobReference section of thejob resource.
Thesource URIs property must be fully-qualified, in the formatgs://bucket/object.Each URI can contain one '*'wildcard character.
Specify the Avro data format by setting thesourceFormat property toAVRO.
To check the job status, calljobs.get(job_id*),wherejob_id is the ID of the job returned by the initialrequest.
- Ifstatus.state = DONE, the job completed successfully.
- If thestatus.errorResult property is present, the request failed,and that object will include information describing what went wrong.When a request fails, no table is created and no data is loaded.
- Ifstatus.errorResult is absent, the job finished successfully,although there might have been some non-fatal errors, such as problemsimporting a few rows. Non-fatal errors are listed in the returned jobobject'sstatus.errors property.

API notes:

Load jobs are atomic and consistent; if a load job fails, none of the datais available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it asjobReference.jobId when callingjobs.insert to create a load job. Thisapproach is more robust to network failure because the client can poll orretry on the known job ID.
Callingjobs.insert on a given job ID is idempotent. You can retry asmany times as you like on the same job ID, and at most one of thoseoperations will succeed.

Go

Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

import("context""fmt""cloud.google.com/go/bigquery")// importAvro demonstrates loading Apache Avro data from Cloud Storage into a table.funcimportAvro(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.avro")gcsRef.SourceFormat=bigquery.Avroloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)job,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}

Java

Before trying this sample, follow theJava setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryJava API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;// Sample to load Avro data from Cloud Storage into a new BigQuery tablepublicclassLoadAvroFromGCS{publicstaticvoidrunLoadAvroFromGCS(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.avro";loadAvroFromGCS(datasetName,tableName,sourceUri);}publicstaticvoidloadAvroFromGCS(StringdatasetName,StringtableName,StringsourceUri){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();TableIdtableId=TableId.of(datasetName,tableName);LoadJobConfigurationloadConfig=LoadJobConfiguration.of(tableId,sourceUri,FormatOptions.avro());// Load data from a GCS Avro file into the tableJobjob=bigquery.create(JobInfo.of(loadConfig));// Blocks until this load table job completes its execution, either failing or succeeding.job=job.waitFor();if(job.isDone()){System.out.println("Avro from GCS successfully loaded in a table");}else{System.out.println("BigQuery was unable to load into the table due to an error:"+job.getStatus().getError());}}catch(BigQueryException|InterruptedExceptione){System.out.println("Column not added during load append \n"+e.toString());}}}

Node.js

Before trying this sample, follow theNode.js setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryNode.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/** * This sample loads the Avro file at * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.avro * * TODO(developer): Replace the following lines with the path to your file. */constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.avro';asyncfunctionloadTableGCSAvro(){// Imports a GCS file into a table with Avro source format./**   * TODO(developer): Uncomment the following lines before running the sample.   */// const datasetId = 'my_dataset';// const tableId = 'us_states';// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstjobConfigurationLoad={load:{sourceFormat:'AVRO'},};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),jobConfigurationLoad);// load() waits for the job to finishconsole.log(`Job${job.id} completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors &&errors.length >0){throwerrors;}}

Python

Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

fromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.AVRO)uri="gs://cloud-samples-data/bigquery/us-states/us-states.avro"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{} rows.".format(destination_table.num_rows))

Extract JSON data from Avro data

There are two ways to ensure that Avro data is loaded intoBigQuery asJSON data:

Annotate your Avro schema withsqlType set toJSON. For example, if youload data with the following Avro schema, then thejson_field column isread as aJSON type:
```
{"type":{"type":"string","sqlType":"JSON"},"name":"json_field"}
```
Specify the BigQuery destination table schema explicitly andset the column type toJSON. For more information, seeSpecifying a schema.

If you do not specify JSON as the type in either the Avro schema or theBigQuery table schema, then the data will be read as aSTRING.

Appending to or overwriting a table with Avro data

You can load additional data into a table either from source files or byappending query results.

In the Google Cloud console, use theWrite preference option to specifywhat action to take when you load data from a source file or from a queryresult.

You have the following options when you load additional data into a table:

Console option	bq tool flag	BigQuery API property	Description
Write if empty	Not supported	`WRITE_EMPTY`	Writes the data only if the table is empty.
Append to table	`--noreplace` or`--replace=false`; if`--[no]replace` is unspecified, the default is append	`WRITE_APPEND`	(Default) Appends the data to the end of the table.
Overwrite table	`--replace` or`--replace=true`	`WRITE_TRUNCATE`	Erases all existing data in a table before writing the new data. This action also deletes the table schema, row level security, and removes any Cloud KMS key.

If you load data into an existing table, the load job can append the data oroverwrite the table.

Note: This page does not cover appending or overwriting partitioned tables. Forinformation on appending and overwriting partitioned tables, see:Appending to and overwriting partitioned table data.

To append or overwrite a table with Avro data:

Console

In the Google Cloud console, open the BigQuery page.
Go to BigQuery
In the left pane, clickExplorer:
In theExplorer pane, expand the project, clickDatasets, andthen click the name of your dataset.
In the details pane, clickCreate table.
On theCreate table page, in theSource section:
- ForCreate table from, select Cloud Storage.
- In the source field, browse to orenter theCloud Storage URI. Note that you cannotinclude multiple URIs in the Google Cloud console, butwildcardsare supported. The Cloud Storage bucket must be in the same locationas the dataset that contains the table you're appending or overwriting.
- ForFile format, selectAvro.
On theCreate table page, in theDestination section:
- ForDataset name, choose the appropriate dataset.
- In theTable name field, enter the name of the table you'reappending or overwriting in BigQuery.
- Verify thatTable type is set toNative table.
In theSchema section, no action is necessary. The schema isself-described in Avro files.
Note: It is possible to modify the table's schema when you append oroverwrite it. For more information on supported schema changes during aload operation, see Modifying table schemas.
ForPartition and cluster settings, leave the default values. Youcannot convert a table to a partitioned or clustered table by appending oroverwriting it, and the Google Cloud console does not supportappending to or overwriting partitioned or clustered tables in a load job.
ClickAdvanced options.
- ForWrite preference, chooseAppend to table orOverwritetable.
- ForUnknown values, leaveIgnore unknown values cleared.This option applies only to CSV and JSON files.
- ForEncryption, clickCustomer-managed key to use aCloud Key Management Service key.If you leave theGoogle-owned and managed key setting,BigQueryencrypts the data at rest.
ClickCreate table.

SQL

Use theLOAD DATA DDL statement.The following example appends an Avro file to the tablemytable:

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery

In the query editor, enter the following statement:

LOADDATAINTOmydataset.mytableFROMFILES(format='avro',uris=['gs://bucket/path/file.avro']);

ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

Enter thebq load command with the--replace flag to overwrite thetable. Use the--noreplace flag to append data to the table. If no flag isspecified, the default is to append data. Supply the--source_format flagand set it toAVRO. Because Avro schemas are automatically retrievedfrom the self-describing source data, you do not need to provide a schemadefinition.

Note: It is possible to modify the table's schema when you append oroverwrite it. For more information on supported schema changes during aload operation, see Modifying table schemas.

(Optional) Supply the--location flag and set the value to yourlocation.

Other optional flags include:

--destination_kms_key: The Cloud KMS key for encryption of thetable data.

bq--location=locationload\--[no]replace\--source_format=format\dataset.table\path_to_source

Replace the following:

location is yourlocation.The--location flag is optional. You can set a default value for thelocation by using the.bigqueryrc file.
format isAVRO.
dataset is an existing dataset.
table is the name of the table into which you're loading data.
path_to_source is a fully-qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.

Examples:

The following command loads data fromgs://mybucket/mydata.avro andoverwrites a table namedmytable inmydataset.

bqload\--replace\--source_format=AVRO\mydataset.mytable\gs://mybucket/mydata.avro

The following command loads data fromgs://mybucket/mydata.avro andappends data to a table namedmytable inmydataset.

bqload\--noreplace\--source_format=AVRO\mydataset.mytable\gs://mybucket/mydata.avro

For information on appending and overwriting partitioned tables using thebq command-line tool, see:Appending to and overwriting partitioned table data.

API

Create aload job that points to the source data in Cloud Storage.
(Optional) Specify yourlocation inthelocation property in thejobReference section of thejob resource.
Thesource URIs propertymust be fully-qualified, in the formatgs://bucket/object. You caninclude multiple URIs as a comma-separated list. Note thatwildcards arealso supported.
Specify the data format by setting theconfiguration.load.sourceFormat property toAVRO.
Specify the write preference by setting theconfiguration.load.writeDisposition property toWRITE_TRUNCATE orWRITE_APPEND.

Go

Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

import("context""fmt""cloud.google.com/go/bigquery")// importAvroTruncate demonstrates loading Apache Avro data from Cloud Storage into a table// and overwriting/truncating existing data in the table.funcimportAvroTruncate(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.avro")gcsRef.SourceFormat=bigquery.Avroloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)// Default for import jobs is to append data to a table.  WriteTruncate// specifies that existing data should instead be replaced/overwritten.loader.WriteDisposition=bigquery.WriteTruncatejob,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}

Java

Before trying this sample, follow theJava setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryJava API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;// Sample to overwrite the BigQuery table data by loading a AVRO file from GCSpublicclassLoadAvroFromGCSTruncate{publicstaticvoidrunLoadAvroFromGCSTruncate(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.avro";loadAvroFromGCSTruncate(datasetName,tableName,sourceUri);}publicstaticvoidloadAvroFromGCSTruncate(StringdatasetName,StringtableName,StringsourceUri){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();TableIdtableId=TableId.of(datasetName,tableName);LoadJobConfigurationloadConfig=LoadJobConfiguration.newBuilder(tableId,sourceUri).setFormatOptions(FormatOptions.avro())// Set the write disposition to overwrite existing table data.setWriteDisposition(JobInfo.WriteDisposition.WRITE_TRUNCATE).build();// Load data from a GCS Avro file into the tableJobjob=bigquery.create(JobInfo.of(loadConfig));// Blocks until this load table job completes its execution, either failing or succeeding.job=job.waitFor();if(job.isDone()){System.out.println("Table is successfully overwritten by AVRO file loaded from GCS");}else{System.out.println("BigQuery was unable to load into the table due to an error:"+job.getStatus().getError());}}catch(BigQueryException|InterruptedExceptione){System.out.println("Column not added during load append \n"+e.toString());}}}

Node.js

Before trying this sample, follow theNode.js setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryNode.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/** * This sample loads the Avro file at * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.avro * * TODO(developer): Replace the following lines with the path to your file. */constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.avro';asyncfunctionloadTableGCSAvroTruncate(){/**   * Imports a GCS file into a table and overwrites   * table data if table already exists.   *//**   * TODO(developer): Uncomment the following lines before running the sample.   */// const datasetId = 'my_dataset';// const tableId = 'us_states';// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstjobConfigurationLoad={load:{sourceFormat:'AVRO',writeDisposition:'WRITE_TRUNCATE',},};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),jobConfigurationLoad);// load() waits for the job to finishconsole.log(`Job${job.id} completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors &&errors.length >0){throwerrors;}}

Python

Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

importiofromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(schema=[bigquery.SchemaField("name","STRING"),bigquery.SchemaField("post_abbr","STRING"),],)body=io.BytesIO(b"Washington,WA")client.load_table_from_file(body,table_id,job_config=job_config).result()previous_rows=client.get_table(table_id).num_rowsassertprevious_rows >0job_config=bigquery.LoadJobConfig(write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,source_format=bigquery.SourceFormat.AVRO,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.avro"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{} rows.".format(destination_table.num_rows))

Loading hive-partitioned Avro data

BigQuery supports loading hive partitioned Avro data stored onCloud Storage and populates the hive partitioning columns as columns inthe destination BigQuery managed table. For more information, seeLoading Externally Partitioned Data from Cloud Storage.

Avro conversions

BigQuery converts Avro data types to the followingBigQuery data types:

Primitive types

Avro data type withoutlogicalType attribute	BigQuery data type	Notes
null	BigQuery ignores these values
boolean	BOOLEAN
int	INTEGER
long	INTEGER
float	FLOAT
double	FLOAT
bytes	BYTES
string	STRING	UTF-8 only

Logical types

By default, BigQuery ignores thelogicalType attribute for mostof the types and uses the underlying Avro type instead. To convertAvro logical types to their corresponding BigQuery data types,set the--use_avro_logical_types flag totrue usingthe bq command-line tool, or set theuseAvroLogicalTypes property in thejob resourcewhen you call thejobs.insertmethod to create a load job.

The table below shows the conversion of Avro logical types toBigQuery data types.

Avro logical type	BigQuery data type: Logical type disabled	BigQuery data type: Logical type enabled
date	INTEGER	DATE
time-millis	INTEGER	TIME
time-micros	INTEGER (converted from LONG)	TIME
timestamp-millis	INTEGER (converted from LONG)	TIMESTAMP
timestamp-micros	INTEGER (converted from LONG)	TIMESTAMP
local-timestamp-millis	INTEGER (converted from LONG)	DATETIME
local-timestamp-micros	INTEGER (converted from LONG)	DATETIME
duration	BYTES (converted from`fixed` type of size 12)	BYTES (converted from`fixed` type of size 12)
decimal	NUMERIC, BIGNUMERIC, or STRING (seeDecimal logical type)	NUMERIC, BIGNUMERIC, or STRING (seeDecimal logical type)

For more information on Avro data types, see theApache Avro™ 1.8.2 Specification.

Note: When exporting to Avro from BigQuery,DATETIME isexported as aSTRING with a custom logical time that is not recognized as aDATETIME upon importing back into BigQuery.

Date logical type

In any Avro file you intend to load, you must specify date logical types in thefollowing format:

{"type":{"logicalType":"date","type":"int"},"name":"date_field"}

Decimal logical type

Decimal logical types can be converted toNUMERIC,BIGNUMERIC, orSTRING types. The converted type dependson the precision and scale parameters of thedecimal logical type and thespecified decimal target types. Specify the decimal target type as follows:

For a load job using thejobs.insert API: use theJobConfigurationLoad.decimalTargetTypesfield.
For aload job using thebq load command in the bq command-line tool:use the--decimal_target_typesflag.
For a query against atable with external sources:use theExternalDataConfiguration.decimalTargetTypesfield.
For apersistent external table created with DDL:use thedecimal_target_types option.

For backward compatibility, if the decimal target types are not specified, you canload an Avro file containing abytes column withthedecimal logical type into aBYTES column of an existing table. In thiscase, thedecimal logical type on the column in the Avro file is ignored. Thisconversion mode is deprecated and might be removed in the future.

For more information on the Avrodecimal logical type, see theApache Avro™ 1.8.2 Specification.

Time logical type

In any Avro file you intend to load, you must specify time logical types in oneof the following formats.

For millisecond precision:

{"type":{"logicalType":"time-millis","type":"int"},"name":"time_millis_field"}

For microsecond precision:

{"type":{"logicalType":"time-micros","type":"int"},"name":"time_micros_field"}

Timestamp logical type

In any Avro file you intend to load, you must specify timestamp logical typesin one of the following formats.

For millisecond precision:

{"type":{"logicalType":"timestamp-millis","type":"long"},"name":"timestamp_millis_field"}

For microsecond precision:

{"type":{"logicalType":"timestamp-micros","type":"long"},"name":"timestamp_micros_field"}

Local-Timestamp logical type

In any Avro file you intend to load, you must specify a local-timestamp logicaltype in one of the following formats.

For millisecond precision:

{"type":{"logicalType":"local-timestamp-millis","type":"long"},"name":"local_timestamp_millis_field"}

For microsecond precision:

{"type":{"logicalType":"local-timestamp-micros","type":"long"},"name":"local_timestamp_micros_field"}

Complex types

Avro data type	BigQuery data type	Notes
record	RECORD	Aliases are ignored Doc is converted into afield description Default values are set at read time Order is ignored Recursive fields are dropped — Only the first level of nesting is maintained for recursive fields
enum	STRING	The string is the symbolic value of the enum Aliases are ignored Doc is converted into afield description
array	repeated fields	Arrays of arrays are not supported. Arrays containing only NULL types are ignored.
map<T>	RECORD	BigQuery converts an Avro map<T> field to a repeated RECORD that contains two fields: a key and a value. BigQuery stores the key as a STRING, and converts the value to its corresponding data type in BigQuery.
union	Nullable field RECORD with a list of nullable fields	When union only has one non-null type, it converts to a nullable field. Otherwise it converts to a RECORD with a list of nullable fields. Only one of these fields will be set at read time.
fixed	BYTES	Aliases are ignored Size is ignored

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Load Avro data from Cloud Storage

Limitations

Input file requirements

Before you begin

Required permissions

Permissions to load data into BigQuery

Permissions to load data from Cloud Storage

Required permissions

Create a dataset and table

Advantages of Avro

Avro schemas

Avro compression

Loading Avro data into a new table

Console

SQL

bq

API

Go

Java

Node.js

Python

Extract JSON data from Avro data

Appending to or overwriting a table with Avro data

Console

SQL

bq

API

Go

Java

Node.js

Python

Loading hive-partitioned Avro data

Avro conversions

Primitive types

Logical types

Date logical type

Decimal logical type

Time logical type

Timestamp logical type

Local-Timestamp logical type

Complex types