Loading ORC data from Cloud Storage

This page provides an overview of loading ORC data from Cloud Storage intoBigQuery.

ORC is anopen source column-oriented data format that is widely used in the Apache Hadoopecosystem.

When you load ORC data from Cloud Storage, you can load the data into a newtable or partition, or you can append to or overwrite an existing table orpartition. When your data is loaded into BigQuery, it isconverted into columnar format forCapacitor(BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table,the dataset that contains the table must be in the same regional or multi-regional location as the Cloud Storage bucket.

For information about loading ORC data from a local file, seeLoading data into BigQuery from a local data source.

Limitations

You are subject to the following limitations when you load data intoBigQuery from a Cloud Storage bucket:

BigQuery does not guarantee data consistency for external datasources. Changes to the underlying data while a query is running can result inunexpected behavior.
BigQuery doesn't supportCloud Storage object versioning. If youinclude a generation number in the Cloud Storage URI, then the load jobfails.

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessarypermissions to perform each task in this document, and create a datasetto store your data.

Required permissions

To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.

Permissions to load data into BigQuery

To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:

bigquery.tables.create
bigquery.tables.updateData
bigquery.tables.update
bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:

roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin (includes thebigquery.jobs.create permission)
bigquery.user (includes thebigquery.jobs.create permission)
bigquery.jobUser (includes thebigquery.jobs.create permission)

Additionally, if you have thebigquery.datasets.create permission, you can create andupdate tables using a load job in the datasets that you create.

For more information on IAM roles and permissions inBigQuery, seePredefined roles and permissions.

Permissions to load data from Cloud Storage

To get the permissions that you need to load data from a Cloud Storage bucket, ask your administrator to grant you theStorage Admin (roles/storage.admin) IAM role on the bucket. For more information about granting roles, seeManage access to projects, folders, and organizations.

This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand theRequired permissions section:

Required permissions

The following permissions are required to load data from a Cloud Storage bucket:

storage.buckets.get
storage.objects.get
storage.objects.list (required if you are using a URIwildcard)

You might also be able to get these permissions withcustom roles or otherpredefined roles.

Create a dataset

Create aBigQuery dataset to storeyour data.

ORC schemas

When you load ORC files into BigQuery, the table schema isautomatically retrieved from the self-describing source data. WhenBigQuery retrieves the schema from the source data, thealphabetically last file is used.

For example, you have the following ORC files in Cloud Storage:

gs://mybucket/00/  a.orc  z.orcgs://mybucket/01/  b.orc

Running this command in the bq command-line tool loads all of the files (as acomma-separated list), and the schema is derived frommybucket/01/b.orc:

bqload\--source_format=ORC\dataset.table\"gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"

When BigQuery detects the schema, some ORC data types areconverted to BigQuery data types to make them compatible withGoogleSQL syntax. All fields in the detected schema areNULLABLE. For more information, seeORC conversions.

When you load multiple ORC files that have different schemas, identicalfields (with the same name and same nested level) specified in multipleschemas must map to the same converted BigQuery data type ineach schema definition.

To provide a table schema for creating external tables, set thereferenceFileSchemaUriproperty in BigQuery API or
--reference_file_schema_uri parameter in bq command-line toolto the URL of the reference file.

For example,--reference_file_schema_uri="gs://mybucket/schema.orc".

ORC compression

BigQuery supports the following compression codecs forORC file contents:

Zlib
Snappy
LZO
LZ4
ZSTD

Data in ORC files doesn't remain compressed after it is uploaded toBigQuery. Data storage is reported in logical bytes or physicalbytes, depending on thedataset storage billing model.To get information on storage usage, query theINFORMATION_SCHEMA.TABLE_STORAGE view.

Loading ORC data into a new table

You can load ORC data into a new table by:

Using the Google Cloud console
Using the bq command-line tool'sbq load command
Calling thejobs.insert API method and configuring aload job
Using the client libraries

To load ORC data from Cloud Storage into a new BigQuerytable:

Console

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery
In the left pane, clickExplorer.
In theExplorer pane, expand your project, clickDatasets, and then select a dataset.
In theDataset info section, clickCreate table.
In theCreate table pane, specify the following details:

In theSource section, selectGoogle Cloud Storage in theCreate table from list. Then, do the following:
1. Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
2. ForFile format, selectORC.
In theDestination section, specify the following details:
1. ForDataset, select the dataset in which you want to create the table.
2. In theTable field, enter the name of the table that you want to create.
3. Verify that theTable type field is set toNative table.
In theSchema section, no action is necessary. The schema is self-described in ORC files.
Optional: SpecifyPartition and cluster settings. For more information, seeCreating partitioned tables andCreating and using clustered tables.
ClickAdvanced options and do the following:
- ForWrite preference, leaveWrite if empty selected. This option creates a new table and loads your data into it.
- If you want to ignore values in a row that are not present in the table's schema, then selectUnknown values.
- ForEncryption, clickCustomer-managed key to use aCloud Key Management Service key. If you leave theGoogle-managed key setting, BigQueryencrypts the data at rest.
ClickCreate table.

Note: When you load data into an empty table by using theGoogle Cloud console, you cannot add a label, description, tableexpiration, or partition expiration.

After the table is created, you can update the table's expiration,description, and labels, but you cannot add a partition expiration after atable is created using the Google Cloud console. For more information, see Managing tables.

SQL

Use theLOAD DATA DDL statement.The following example loads an ORC file into the new tablemytable:

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery

In the query editor, enter the following statement:

LOADDATAOVERWRITEmydataset.mytableFROMFILES(format='ORC',uris=['gs://bucket/path/file.orc']);

ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

Use thebq load command, specify ORC as thesource_format, and include aCloud Storage URI.You can include a single URI, a comma-separated list of URIs or a URIcontaining awildcard.

(Optional) Supply the--location flag and set the value to yourlocation.

Other optional flags include:

--time_partitioning_type: Enables time-based partitioning on a table andsets the partition type. Possible values areHOUR,DAY,MONTH, andYEAR. This flag is optional when you create atable partitioned on aDATE,DATETIME, orTIMESTAMP column. Thedefault partition type for time-based partitioning isDAY. You cannotchange the partitioning specification on an existing table.
--time_partitioning_expiration: An integer that specifies (in seconds)when a time-based partition should be deleted. The expiration time evaluatesto the partition's UTC date plus the integer value.
--time_partitioning_field: TheDATE orTIMESTAMP column used tocreate a partitioned table. If time-based partitioning is enabled withoutthis value, an ingestion-time partitioned table is created.
--require_partition_filter: When enabled, this option requires usersto include aWHERE clause that specifies the partitions to query.Requiring a partition filter may reduce cost and improve performance.For more information, seeRequire a partition filter in queries.
--clustering_fields: A comma-separated list of up to four column namesused to create aclustered table.
--destination_kms_key: The Cloud KMS key for encryption of thetable data.
For more information about partitioned tables, see:
- Creating partitioned tables
For more information about clustered tables, see:
- Creating and using clustered tables
For more information about table encryption, see:
- Protecting data with Cloud KMS keys

To load ORC data into BigQuery, enter the following command:

bq--location=locationload\--source_format=format\dataset.table\path_to_source

Where:

location is your location. The--location flag is optional.For example, if you are using BigQuery in the Tokyo region,you can set the flag's value toasia-northeast1. You can set a defaultvalue for the location using the.bigqueryrc file.
format isORC.
dataset is an existing dataset.
table is the name of the table into which you're loading data.
path_to_source is a fully-qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.

Examples:

The following command loads data fromgs://mybucket/mydata.orc into atable namedmytable inmydataset.

bqload\--source_format=ORC\mydataset.mytable\gs://mybucket/mydata.orc

The following command loads data fromgs://mybucket/mydata.orc into a newingestion-time partitioned table namedmytable inmydataset.

bqload\--source_format=ORC\--time_partitioning_type=DAY\mydataset.mytable\gs://mybucket/mydata.orc

The following command loads data fromgs://mybucket/mydata.orc into apartitioned table namedmytable inmydataset. The table is partitionedon themytimestamp column.

bqload\--source_format=ORC\--time_partitioning_fieldmytimestamp\mydataset.mytable\gs://mybucket/mydata.orc

The following command loads data from multiple files ings://mybucket/into a table namedmytable inmydataset. The Cloud Storage URI uses awildcard.

bqload\--source_format=ORC\mydataset.mytable\gs://mybucket/mydata*.orc

The following command loads data from multiple files ings://mybucket/into a table namedmytable inmydataset. The command includes a comma-separated list of Cloud Storage URIs with wildcards.

bqload--autodetect\--source_format=ORC\mydataset.mytable\"gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"

API

Create aload job that points to the source data in Cloud Storage.
(Optional) Specify yourlocation inthelocation property in thejobReference section of thejob resource.
Thesource URIs property must be fully-qualified, in the formatgs://bucket/object.Each URI can contain one '*'wildcard character.
Specify the ORC data format by setting thesourceFormat property toORC.
To check the job status, calljobs.get(job_id*),wherejob_id is the ID of the job returned by the initialrequest.
- Ifstatus.state = DONE, the job completed successfully.
- If thestatus.errorResult property is present, the request failed,and that object includes information describing what went wrong.When a request fails, no table is created and no data is loaded.
- Ifstatus.errorResult is absent, the job finished successfully,although there might have been some non-fatal errors, such as problemsimporting a few rows. Non-fatal errors are listed in the returned jobobject'sstatus.errors property.

API notes:

Load jobs are atomic and consistent; if a load job fails, none of the datais available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it asjobReference.jobId when callingjobs.insert to create a load job. Thisapproach is more robust to network failure because the client can poll orretry on the known job ID.
Callingjobs.insert on a given job ID is idempotent. You can retry asmany times as you like on the same job ID, and at most one of thoseoperations succeeds.

C#

Before trying this sample, follow theC# setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryC# API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

usingGoogle.Apis.Bigquery.v2.Data;usingGoogle.Cloud.BigQuery.V2;usingSystem;publicclassBigQueryLoadTableGcsOrc{publicvoidLoadTableGcsOrc(stringprojectId="your-project-id",stringdatasetId="your_dataset_id"){BigQueryClientclient=BigQueryClient.Create(projectId);vargcsURI="gs://cloud-samples-data/bigquery/us-states/us-states.orc";vardataset=client.GetDataset(datasetId);TableReferencedestinationTableRef=dataset.GetTableReference(tableId:"us_states");// Create job configurationvarjobOptions=newCreateLoadJobOptions(){SourceFormat=FileFormat.Orc};// Create and run jobvarloadJob=client.CreateLoadJob(sourceUri:gcsURI,destination:destinationTableRef,// Pass null as the schema because the schema is inferred when// loading Orc dataschema:null,options:jobOptions);loadJob=loadJob.PollUntilCompleted().ThrowOnAnyError();// Waits for the job to complete.// Display the number of rows uploadedBigQueryTabletable=client.GetTable(destinationTableRef);Console.WriteLine($"Loaded {table.Resource.NumRows} rows to {table.FullyQualifiedId}");}}

Go

Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

import("context""fmt""cloud.google.com/go/bigquery")// importORCTruncate demonstrates loading Apache ORC data from Cloud Storage into a table.funcimportORC(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.orc")gcsRef.SourceFormat=bigquery.ORCloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)job,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}

Java

importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.Field;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.Schema;importcom.google.cloud.bigquery.StandardSQLTypeName;importcom.google.cloud.bigquery.TableId;// Sample to load ORC data from Cloud Storage into a new BigQuery tablepublicclassLoadOrcFromGCS{publicstaticvoidrunLoadOrcFromGCS(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.orc";Schemaschema=Schema.of(Field.of("name",StandardSQLTypeName.STRING),Field.of("post_abbr",StandardSQLTypeName.STRING));loadOrcFromGCS(datasetName,tableName,sourceUri,schema);}publicstaticvoidloadOrcFromGCS(StringdatasetName,StringtableName,StringsourceUri,Schemaschema){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();TableIdtableId=TableId.of(datasetName,tableName);LoadJobConfigurationloadConfig=LoadJobConfiguration.newBuilder(tableId,sourceUri,FormatOptions.orc()).setSchema(schema).build();// Load data from a GCS ORC file into the tableJobjob=bigquery.create(JobInfo.of(loadConfig));// Blocks until this load table job completes its execution, either failing or succeeding.job=job.waitFor();if(job.isDone() &&job.getStatus().getError()==null){System.out.println("ORC from GCS successfully added during load append job");}else{System.out.println("BigQuery was unable to load into the table due to an error:"+job.getStatus().getError());}}catch(BigQueryException|InterruptedExceptione){System.out.println("Column not added during load append \n"+e.toString());}}}

Node.js

Before trying this sample, follow theNode.js setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryNode.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/** * This sample loads the ORC file at * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.orc * * TODO(developer): Replace the following lines with the path to your file. */constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.orc';asyncfunctionloadTableGCSORC(){// Imports a GCS file into a table with ORC source format./**   * TODO(developer): Uncomment the following line before running the sample.   */// const datasetId = 'my_dataset';// const tableId = 'my_table'// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'ORC',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id} completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors &&errors.length >0){throwerrors;}}

PHP

Before trying this sample, follow thePHP setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPHP API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId  = 'The Google project ID';// $datasetId  = 'The BigQuery dataset ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient([    'projectId' => $projectId,]);$dataset = $bigQuery->dataset($datasetId);$table = $dataset->table('us_states');// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('ORC');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) {    print('Waiting for job to complete' . PHP_EOL);    $job->reload();    if (!$job->isComplete()) {        throw new Exception('Job has not yet completed', 500);    }});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) {    $error = $job->info()['status']['errorResult']['message'];    printf('Error running job: %s' . PHP_EOL, $error);} else {    print('Data imported successfully' . PHP_EOL);}

Python

Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

fromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.ORC)uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{} rows.".format(destination_table.num_rows))

Ruby

Before trying this sample, follow theRuby setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryRuby API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

require"google/cloud/bigquery"defload_table_gcs_orcdataset_id="your_dataset_id"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.datasetdataset_idgcs_uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"table_id="us_states"load_job=dataset.load_jobtable_id,gcs_uri,format:"orc"puts"Starting job#{load_job.job_id}"load_job.wait_until_done!# Waits for table load to complete.puts"Job finished."table=dataset.tabletable_idputs"Loaded#{table.rows_count} rows to table#{table.id}"end

Append to or overwrite a table with ORC data

You can load additional data into a table either from source files or byappending query results.

In the Google Cloud console, use theWrite preference option to specifywhat action to take when you load data from a source file or from a queryresult.

You have the following options when you load additional data into a table:

Console option	bq tool flag	BigQuery API property	Description
Write if empty	Not supported	`WRITE_EMPTY`	Writes the data only if the table is empty.
Append to table	`--noreplace` or`--replace=false`; if`--[no]replace` is unspecified, the default is append	`WRITE_APPEND`	(Default) Appends the data to the end of the table.
Overwrite table	`--replace` or`--replace=true`	`WRITE_TRUNCATE`	Erases all existing data in a table before writing the new data. This action also deletes the table schema, row level security, and removes any Cloud KMS key.

If you load data into an existing table, the load job can append the data oroverwrite the table.

You can append or overwrite a table by:

Using the Google Cloud console
Using the bq command-line tool'sbq load command
Calling thejobs.insert API method and configuring aload job
Using the client libraries

Note: This page does not cover appending or overwriting partitioned tables. Forinformation on appending and overwriting partitioned tables, see:Appending to and overwriting partitioned table data.

To append or overwrite a table with ORC data:

Console

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery
In the left pane, clickExplorer.
In theExplorer pane, expand your project, clickDatasets, and then select a dataset.
In theDataset info section, clickCreate table.
In theCreate table pane, specify the following details:

In theSource section, selectGoogle Cloud Storage in theCreate table from list. Then, do the following:
1. Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
2. ForFile format, selectORC.

Note: It is possible to modify the table's schema when you append or overwrite it. For more information about supported schema changes during a load operation, see

Modifying table schemas

In theDestination section, specify the following details:
1. ForDataset, select the dataset in which you want to create the table.
2. In theTable field, enter the name of the table that you want to create.
3. Verify that theTable type field is set toNative table.
In theSchema section, no action is necessary. The schema is self-described in ORC files.

Note: It is possible to modify the table's schema when you append or overwrite it. For more information about supported schema changes during a load operation, see

Modifying table schemas

Optional: SpecifyPartition and cluster settings. For more information, seeCreating partitioned tables andCreating and using clustered tables. You cannot convert a table to a partitioned or clustered table by appending or overwriting it. The Google Cloud console does not support appending to or overwriting partitioned or clustered tables in a load job.
ClickAdvanced options and do the following:
- ForWrite preference, chooseAppend to table orOverwrite table.
- If you want to ignore values in a row that are not present in the table's schema, then selectUnknown values.
- ForEncryption, clickCustomer-managed key to use aCloud Key Management Service key. If you leave theGoogle-managed key setting, BigQueryencrypts the data at rest.
ClickCreate table.

SQL

Use theLOAD DATA DDL statement.The following example appends an ORC file to the tablemytable:

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery

In the query editor, enter the following statement:

LOADDATAINTOmydataset.mytableFROMFILES(format='ORC',uris=['gs://bucket/path/file.orc']);

ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

Enter thebq load command with the--replace flag to overwrite thetable. Use the--noreplace flag to append data to the table. If no flag isspecified, the default is to append data. Supply the--source_format flagand set it toORC. Because ORC schemas are automatically retrievedfrom the self-describing source data, you don't need to provide a schemadefinition.

Note: It is possible to modify the table's schema when you append oroverwrite it. For more information about supported schema changes during aload operation, see Modifying table schemas.

(Optional) Supply the--location flag and set the value to yourlocation.

Other optional flags include:

--destination_kms_key: The Cloud KMS key for encryption of thetable data.

bq--location=locationload\--[no]replace\--source_format=format\dataset.table\path_to_source

Where:

location is yourlocation.The--location flag is optional. You can set a default value for thelocation by using the.bigqueryrc file.
format isORC.
dataset is an existing dataset.
table is the name of the table into which you're loading data.
path_to_source is a fully-qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.

Examples:

The following command loads data fromgs://mybucket/mydata.orc andoverwrites a table namedmytable inmydataset.

bqload\--replace\--source_format=ORC\mydataset.mytable\gs://mybucket/mydata.orc

The following command loads data fromgs://mybucket/mydata.orc andappends data to a table namedmytable inmydataset.

bqload\--noreplace\--source_format=ORC\mydataset.mytable\gs://mybucket/mydata.orc

For information about appending and overwriting partitioned tables using thebq command-line tool, see:Appending to and overwriting partitioned table data.

API

Create aload job that points to the source data in Cloud Storage.
(Optional) Specify yourlocationin thelocation property in thejobReference section of thejob resource.
Thesource URIs propertymust be fully-qualified, in the formatgs://bucket/object. You caninclude multiple URIs as a comma-separated list. Note thatwildcards arealso supported.
Specify the data format by setting theconfiguration.load.sourceFormat property toORC.
Specify the write preference by setting theconfiguration.load.writeDisposition property toWRITE_TRUNCATE orWRITE_APPEND.

C#

Before trying this sample, follow theC# setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryC# API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

usingGoogle.Apis.Bigquery.v2.Data;usingGoogle.Cloud.BigQuery.V2;usingSystem;publicclassBigQueryLoadTableGcsOrcTruncate{publicvoidLoadTableGcsOrcTruncate(stringprojectId="your-project-id",stringdatasetId="your_dataset_id",stringtableId="your_table_id"){BigQueryClientclient=BigQueryClient.Create(projectId);vargcsURI="gs://cloud-samples-data/bigquery/us-states/us-states.orc";vardataset=client.GetDataset(datasetId);TableReferencedestinationTableRef=dataset.GetTableReference(tableId:"us_states");// Create job configurationvarjobOptions=newCreateLoadJobOptions(){SourceFormat=FileFormat.Orc,WriteDisposition=WriteDisposition.WriteTruncate};// Create and run jobvarloadJob=client.CreateLoadJob(sourceUri:gcsURI,destination:destinationTableRef,// Pass null as the schema because the schema is inferred when// loading Orc dataschema:null,options:jobOptions);loadJob=loadJob.PollUntilCompleted().ThrowOnAnyError();// Waits for the job to complete.// Display the number of rows uploadedBigQueryTabletable=client.GetTable(destinationTableRef);Console.WriteLine($"Loaded {table.Resource.NumRows} rows to {table.FullyQualifiedId}");}}

Go

Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

import("context""fmt""cloud.google.com/go/bigquery")// importORCTruncate demonstrates loading Apache ORC data from Cloud Storage into a table// and overwriting/truncating existing data in the table.funcimportORCTruncate(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.orc")gcsRef.SourceFormat=bigquery.ORCloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)// Default for import jobs is to append data to a table.  WriteTruncate// specifies that existing data should instead be replaced/overwritten.loader.WriteDisposition=bigquery.WriteTruncatejob,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}

Java

importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;// Sample to overwrite the BigQuery table data by loading a ORC file from GCSpublicclassLoadOrcFromGcsTruncate{publicstaticvoidrunLoadOrcFromGcsTruncate(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.orc";loadOrcFromGcsTruncate(datasetName,tableName,sourceUri);}publicstaticvoidloadOrcFromGcsTruncate(StringdatasetName,StringtableName,StringsourceUri){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();TableIdtableId=TableId.of(datasetName,tableName);LoadJobConfigurationloadConfig=LoadJobConfiguration.newBuilder(tableId,sourceUri).setFormatOptions(FormatOptions.orc())// Set the write disposition to overwrite existing table data.setWriteDisposition(JobInfo.WriteDisposition.WRITE_TRUNCATE).build();// Load data from a GCS ORC file into the tableJobjob=bigquery.create(JobInfo.of(loadConfig));// Blocks until this load table job completes its execution, either failing or succeeding.job=job.waitFor();if(job.isDone() &&job.getStatus().getError()==null){System.out.println("Table is successfully overwritten by ORC file loaded from GCS");}else{System.out.println("BigQuery was unable to load into the table due to an error:"+job.getStatus().getError());}}catch(BigQueryException|InterruptedExceptione){System.out.println("Column not added during load append \n"+e.toString());}}}

Node.js

Before trying this sample, follow theNode.js setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryNode.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate the clientsconstbigquery=newBigQuery();conststorage=newStorage();/** * This sample loads the CSV file at * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv * * TODO(developer): Replace the following lines with the path to your file. */constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.orc';asyncfunctionloadORCFromGCSTruncate(){/**   * Imports a GCS file into a table and overwrites   * table data if table already exists.   *//**   * TODO(developer): Uncomment the following lines before running the sample.   */// const datasetId = "my_dataset";// const tableId = "my_table";// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'ORC',// Set the write disposition to overwrite existing table data.writeDisposition:'WRITE_TRUNCATE',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id} completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors &&errors.length >0){throwerrors;}}

PHP

Before trying this sample, follow thePHP setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPHP API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// $tableID = 'The BigQuery table ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient([    'projectId' => $projectId,]);$table = $bigQuery->dataset($datasetId)->table($tableId);// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('ORC')->writeDisposition('WRITE_TRUNCATE');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) {    print('Waiting for job to complete' . PHP_EOL);    $job->reload();    if (!$job->isComplete()) {        throw new Exception('Job has not yet completed', 500);    }});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) {    $error = $job->info()['status']['errorResult']['message'];    printf('Error running job: %s' . PHP_EOL, $error);} else {    print('Data imported successfully' . PHP_EOL);}

Python

Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

To replace the rows in an existing table, set theLoadJobConfig.write_dispositionpropertyto theWRITE_TRUNCATE.

importiofromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(schema=[bigquery.SchemaField("name","STRING"),bigquery.SchemaField("post_abbr","STRING"),],)body=io.BytesIO(b"Washington,WA")client.load_table_from_file(body,table_id,job_config=job_config).result()previous_rows=client.get_table(table_id).num_rowsassertprevious_rows >0job_config=bigquery.LoadJobConfig(write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,source_format=bigquery.SourceFormat.ORC,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{} rows.".format(destination_table.num_rows))

Ruby

Before trying this sample, follow theRuby setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryRuby API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

require"google/cloud/bigquery"defload_table_gcs_orc_truncatedataset_id="your_dataset_id",table_id="your_table_id"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.datasetdataset_idgcs_uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"load_job=dataset.load_jobtable_id,gcs_uri,format:"orc",write:"truncate"puts"Starting job#{load_job.job_id}"load_job.wait_until_done!# Waits for table load to complete.puts"Job finished."table=dataset.tabletable_idputs"Loaded#{table.rows_count} rows to table#{table.id}"end

Load hive-partitioned ORC data

BigQuery supports loading hive partitioned ORC data stored onCloud Storage and populates the hive partitioning columns as columns inthe destination BigQuery managed table. For more information, seeLoading Externally Partitioned Data from Cloud Storage.

ORC conversions

BigQuery converts ORC data types to the followingBigQuery data types:

Primitive types

ORC data type	BigQuery data type	Notes
boolean	BOOLEAN
byte	INTEGER
short	INTEGER
int	INTEGER
long	INTEGER
float	FLOAT
double	FLOAT
string	STRING	UTF-8 only
varchar	STRING	UTF-8 only
char	STRING	UTF-8 only
binary	BYTES
date	DATE	An attempt to convert any value in the ORC data that is less than -719162 days or greater than 2932896 days returns an`invalid date value` error. If this affects you, contactSupport to have unsupported values converted to the BigQuery minimum value of`0001-01-01` or maximum value of`9999-12-31`, as appropriate.
timestamp	TIMESTAMP	ORC supports nanosecond precision, but BigQuery converts sub-microsecond values to microseconds when the data is read. An attempt to convert any value in the ORC data that is less than -719162 days or greater than 2932896 days returns an`invalid date value` error. If this affects you, contactSupport to have unsupported values converted to the BigQuery minimum value of`0001-01-01` or maximum value of`9999-12-31`, as appropriate.
decimal	NUMERIC, BIGNUMERIC, or STRING	SeeDecimal type.

Decimal type

Decimal logical types can be converted toNUMERIC,BIGNUMERIC, orSTRING types. The converted type dependson the precision and scale parameters of thedecimal logical type and thespecified decimal target types. Specify the decimal target type as follows:

For aload job using thejobs.insert API: use theJobConfigurationLoad.decimalTargetTypesfield.
For aload job using thebq load command in the bq command-line tool:use the--decimal_target_typesflag.
For a query against atable with external sources:use theExternalDataConfiguration.decimalTargetTypesfield.
For apersistent external table created with DDL:use thedecimal_target_types option.

Complex types

ORC data type	BigQuery data type	Notes
struct	RECORD	All fields are NULLABLE. Order of fields is ignored. Name of a field must be a validcolumn name.
map<K,V>	RECORD	An ORC map<K,V> field is converted to a repeated RECORD that contains two fields: a key of the same data type as K, and a value of the same data type as V. Both fields are NULLABLE.
list	repeated fields	Nested lists and lists of maps are not supported.
union	RECORD	When union only has one variant, it's converted to a NULLABLE field. Otherwise a union is converted to a RECORD with a list of NULLABLE fields. The NULLABLE fields have suffixes such as field_0, field_1, and so on. Only one of these fields is assigned a value when the data is read.

Column names

A column name can contain letters (a-z, A-Z), numbers (0-9), or underscores(_), and it must start with a letter or underscore. If you use flexible columnnames, BigQuery supports starting a column name with a number.Exercise caution when starting columns with a number, since using flexiblecolumn names with the BigQuery Storage Read API orBigQuery Storage Write API requires special handling. For more information aboutflexible column name support, seeflexible column names.

Column names have a maximum length of 300 characters. Column names can't use anyof the following prefixes:

_TABLE_
_FILE_
_PARTITION
_ROW_TIMESTAMP
__ROOT__
_COLIDENTIFIER
_CHANGE_SEQUENCE_NUMBER
_CHANGE_TYPE
_CHANGE_TIMESTAMP

Duplicate column names are not allowed even if the case differs. For example, acolumn namedColumn1 is considered identical to a column namedcolumn1. Tolearn more about column naming rules, seeColumnnames in theGoogleSQL reference.

If a table name (for example,test) is the same as one of its column names(for example,test), theSELECT expression interprets thetest column asaSTRUCT containing all other table columns. To avoid this collision, useone of the following methods:

Avoid using the same name for a table and its columns.
Avoid using_field_ as a column name prefix. System-reserved prefixescause automatic renaming during queries. For example, theSELECT _field_ FROM project1.dataset.test query returns a column named_field_1. If you must query a column with this name, use an alias tocontrol the output.
Assign the table a different alias. For example, the following query assignsa table aliast to the tableproject1.dataset.test:
```
SELECTtestFROMproject1.dataset.testASt;
```
Include the table name when referencing a column. For example:
```
SELECTtest.testFROMproject1.dataset.test;
```

Flexible column names

You have more flexibility in what you name columns, including expanded accessto characters in languages other than English as well as additional symbols.Make sure to use backtick (`) characters to enclose flexible column names if they areQuoted Identifiers.

Flexible column names support the following characters:

Any letter in any language, as represented by the Unicode regular expression\p{L}.
Any numeric character in any language as represented by the Unicode regularexpression\p{N}.
Any connector punctuation character, including underscores, as representedby the Unicode regular expression\p{Pc}.
A hyphen or dash as represented by the Unicode regular expression\p{Pd}.
Any mark intended to accompany another character as represented by theUnicode regular expression\p{M}.For example, accents, umlauts, or enclosing boxes.
The following special characters:
- An ampersand (&) as represented by the Unicode regularexpression\u0026.
- A percent sign (%) as represented by the Unicode regularexpression\u0025.
- An equals sign (=) as represented by the Unicode regularexpression\u003D.
- A plus sign (+) as represented by the Unicode regularexpression\u002B.
- A colon (:) as represented by the Unicode regularexpression\u003A.
- An apostrophe (') as represented by the Unicode regularexpression\u0027.
- A less-than sign (<) as represented by the Unicode regularexpression\u003C.
- A greater-than sign (>) as represented by the Unicode regularexpression\u003E.
- A number sign (#) as represented by the Unicode regularexpression\u0023.
- A vertical line (|) as represented by the Unicode regularexpression\u007c.
- Whitespace.

Flexible column names don't support the following special characters:

An exclamation mark (!) as represented by the Unicode regularexpression\u0021.
A quotation mark (") as represented by the Unicode regularexpression\u0022.
A dollar sign ($) as represented by the Unicode regularexpression\u0024.
A left parenthesis (() as represented by the Unicode regularexpression\u0028.
A right parenthesis ()) as represented by the Unicode regularexpression\u0029.
An asterisk (*) as represented by the Unicode regularexpression\u002A.
A comma (,) as represented by the Unicode regularexpression\u002C.
A period (.) as represented by the Unicode regularexpression\u002E. Periods arenot replaced by underscores in Parquet filecolumn names when a column name character map is used. For more information,seeflexible column limitations.
A slash (/) as represented by the Unicode regularexpression\u002F.
A semicolon (;) as represented by the Unicode regularexpression\u003B.
A question mark (?) as represented by the Unicode regularexpression\u003F.
An at sign (@) as represented by the Unicode regularexpression\u0040.
A left square bracket ([) as represented by the Unicode regularexpression\u005B.
A backslash (\) as represented by the Unicode regularexpression\u005C.
A right square bracket (]) as represented by the Unicode regularexpression\u005D.
A circumflex accent (^) as represented by the Unicode regularexpression\u005E.
A grave accent (`) as represented by the Unicode regularexpression\u0060.
A left curly bracket {{) as represented by the Unicode regularexpression\u007B.
A right curly bracket (}) as represented by the Unicode regularexpression\u007D.
A tilde (~) as represented by the Unicode regular expression\u007E.

For additional guidelines, seeColumn names.

The expanded column characters are supported by both the BigQuery Storage Read APIand the BigQuery Storage Write API. To use the expanded list of Unicode characterswith the BigQuery Storage Read API, you must set a flag. You can use thedisplayName attribute to retrieve the column name. The following exampleshows how to set a flag with the Python client:

fromgoogle.cloud.bigquery_storageimporttypesrequested_session=types.ReadSession()#set avro serialization options for flexible column.options=types.AvroSerializationOptions()options.enable_display_name_attribute=Truerequested_session.read_options.avro_serialization_options=options

To use the expanded list of Unicode characters with the BigQuery Storage Write API,you must provide the schema withcolumn_name notation, unless you are usingtheJsonStreamWriter writer object. The following example shows how toprovide the schema:

syntax="proto2";packagemypackage;// Source protos located in github.com/googleapis/googleapisimport"google/cloud/bigquery/storage/v1/annotations.proto";messageFlexibleSchema{optionalstringitem_name_column=1[(.google.cloud.bigquery.storage.v1.column_name)="name-列"];optionalstringitem_description_column=2[(.google.cloud.bigquery.storage.v1.column_name)="description-列"];}

In this example,item_name_column anditem_description_column areplaceholder names which need to be compliant with theprotocol buffer namingconvention. Note thatcolumn_name annotations always take precedence overplaceholder names.

Limitations

Flexible column names are not supported withexternal tables.

`NULL` values

Note that for load jobs, BigQuery ignoresNULL elements for thelist compound type, since otherwise they would be translated toNULLARRAYelements which cannot persist to a table (seeData Types fordetails).

For more information on ORC data types, see theApache ORC™ Specification v1.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Loading ORC data from Cloud Storage

Limitations

Before you begin

Required permissions

Permissions to load data into BigQuery

Permissions to load data from Cloud Storage

Required permissions

Create a dataset

ORC schemas

ORC compression

Loading ORC data into a new table

Console

SQL

bq

API

C#

Go

Java

Node.js

PHP

Python

Ruby

Append to or overwrite a table with ORC data

Console

SQL

bq

API

C#

Go

Java

Node.js

PHP

Python

Ruby

Load hive-partitioned ORC data

ORC conversions

Primitive types

Decimal type

Complex types

Column names

Flexible column names

Limitations

NULL values

`NULL` values