Loading ORC data from Cloud Storage
This page provides an overview of loading ORC data from Cloud Storage intoBigQuery.
ORC is anopen source column-oriented data format that is widely used in the Apache Hadoopecosystem.
When you load ORC data from Cloud Storage, you can load the data into a newtable or partition, or you can append to or overwrite an existing table orpartition. When your data is loaded into BigQuery, it isconverted into columnar format forCapacitor(BigQuery's storage format).
When you load data from Cloud Storage into a BigQuery table,the dataset that contains the table must be in the same regional or multi-regional location as the Cloud Storage bucket.
For information about loading ORC data from a local file, seeLoading data into BigQuery from a local data source.
Limitations
You are subject to the following limitations when you load data intoBigQuery from a Cloud Storage bucket:
- BigQuery does not guarantee data consistency for external datasources. Changes to the underlying data while a query is running can result inunexpected behavior.
- BigQuery doesn't supportCloud Storage object versioning. If youinclude a generation number in the Cloud Storage URI, then the load jobfails.
Before you begin
Grant Identity and Access Management (IAM) roles that give users the necessarypermissions to perform each task in this document, and create a datasetto store your data.
Required permissions
To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.
Permissions to load data into BigQuery
To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:
bigquery.tables.createbigquery.tables.updateDatabigquery.tables.updatebigquery.jobs.create
Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:
roles/bigquery.dataEditorroles/bigquery.dataOwnerroles/bigquery.admin(includes thebigquery.jobs.createpermission)bigquery.user(includes thebigquery.jobs.createpermission)bigquery.jobUser(includes thebigquery.jobs.createpermission)
Additionally, if you have thebigquery.datasets.create permission, you can create andupdate tables using a load job in the datasets that you create.
For more information on IAM roles and permissions inBigQuery, seePredefined roles and permissions.
Permissions to load data from Cloud Storage
To get the permissions that you need to load data from a Cloud Storage bucket, ask your administrator to grant you theStorage Admin (roles/storage.admin) IAM role on the bucket. For more information about granting roles, seeManage access to projects, folders, and organizations.
This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand theRequired permissions section:
Required permissions
The following permissions are required to load data from a Cloud Storage bucket:
storage.buckets.getstorage.objects.getstorage.objects.list (required if you are using a URIwildcard)
You might also be able to get these permissions withcustom roles or otherpredefined roles.
Create a dataset
Create aBigQuery dataset to storeyour data.
ORC schemas
When you load ORC files into BigQuery, the table schema isautomatically retrieved from the self-describing source data. WhenBigQuery retrieves the schema from the source data, thealphabetically last file is used.
For example, you have the following ORC files in Cloud Storage:
gs://mybucket/00/ a.orc z.orcgs://mybucket/01/ b.orc
Running this command in the bq command-line tool loads all of the files (as acomma-separated list), and the schema is derived frommybucket/01/b.orc:
bqload\--source_format=ORC\dataset.table\"gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"
When BigQuery detects the schema, some ORC data types areconverted to BigQuery data types to make them compatible withGoogleSQL syntax. All fields in the detected schema areNULLABLE. For more information, seeORC conversions.
When you load multiple ORC files that have different schemas, identicalfields (with the same name and same nested level) specified in multipleschemas must map to the same converted BigQuery data type ineach schema definition.
To provide a table schema for creating external tables, set thereferenceFileSchemaUriproperty in BigQuery API or--reference_file_schema_uri parameter in bq command-line toolto the URL of the reference file.For example,--reference_file_schema_uri="gs://mybucket/schema.orc".
ORC compression
BigQuery supports the following compression codecs forORC file contents:
ZlibSnappyLZOLZ4ZSTD
Data in ORC files doesn't remain compressed after it is uploaded toBigQuery. Data storage is reported in logical bytes or physicalbytes, depending on thedataset storage billing model.To get information on storage usage, query theINFORMATION_SCHEMA.TABLE_STORAGE view.
Loading ORC data into a new table
You can load ORC data into a new table by:
- Using the Google Cloud console
- Using the bq command-line tool's
bq loadcommand - Calling the
jobs.insertAPI method and configuring aloadjob - Using the client libraries
To load ORC data from Cloud Storage into a new BigQuerytable:
Console
In the Google Cloud console, go to theBigQuery page.
- In the left pane, clickExplorer.
- In theExplorer pane, expand your project, clickDatasets, and then select a dataset.
- In theDataset info section, clickCreate table.
- In theCreate table pane, specify the following details:
- In theSource section, selectGoogle Cloud Storage in theCreate table from list. Then, do the following:
- Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.

- ForFile format, selectORC.
- Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
- In theDestination section, specify the following details:
- ForDataset, select the dataset in which you want to create the table.
- In theTable field, enter the name of the table that you want to create.
- Verify that theTable type field is set toNative table.
- In theSchema section, no action is necessary. The schema is self-described in ORC files.
- Optional: SpecifyPartition and cluster settings. For more information, seeCreating partitioned tables andCreating and using clustered tables.
- ClickAdvanced options and do the following:
- ForWrite preference, leaveWrite if empty selected. This option creates a new table and loads your data into it.
- If you want to ignore values in a row that are not present in the table's schema, then selectUnknown values.
- ForEncryption, clickCustomer-managed key to use aCloud Key Management Service key. If you leave theGoogle-managed key setting, BigQueryencrypts the data at rest.
- ClickCreate table.
After the table is created, you can update the table's expiration,description, and labels, but you cannot add a partition expiration after atable is created using the Google Cloud console. For more information, seeManaging tables.
SQL
Use theLOAD DATA DDL statement.The following example loads an ORC file into the new tablemytable:
In the Google Cloud console, go to theBigQuery page.
In the query editor, enter the following statement:
LOADDATAOVERWRITEmydataset.mytableFROMFILES(format='ORC',uris=['gs://bucket/path/file.orc']);
ClickRun.
For more information about how to run queries, seeRun an interactive query.
bq
Use thebq load command, specify ORC as thesource_format, and include aCloud Storage URI.You can include a single URI, a comma-separated list of URIs or a URIcontaining awildcard.
(Optional) Supply the--location flag and set the value to yourlocation.
Other optional flags include:
--time_partitioning_type: Enables time-based partitioning on a table andsets the partition type. Possible values areHOUR,DAY,MONTH, andYEAR. This flag is optional when you create atable partitioned on aDATE,DATETIME, orTIMESTAMPcolumn. Thedefault partition type for time-based partitioning isDAY. You cannotchange the partitioning specification on an existing table.--time_partitioning_expiration: An integer that specifies (in seconds)when a time-based partition should be deleted. The expiration time evaluatesto the partition's UTC date plus the integer value.--time_partitioning_field: TheDATEorTIMESTAMPcolumn used tocreate a partitioned table. If time-based partitioning is enabled withoutthis value, an ingestion-time partitioned table is created.--require_partition_filter: When enabled, this option requires usersto include aWHEREclause that specifies the partitions to query.Requiring a partition filter may reduce cost and improve performance.For more information, seeRequire a partition filter in queries.--clustering_fields: A comma-separated list of up to four column namesused to create aclustered table.--destination_kms_key: The Cloud KMS key for encryption of thetable data.For more information about partitioned tables, see:
For more information about clustered tables, see:
For more information about table encryption, see:
To load ORC data into BigQuery, enter the following command:
bq--location=locationload\--source_format=format\dataset.table\path_to_source
Where:
- location is your location. The
--locationflag is optional.For example, if you are using BigQuery in the Tokyo region,you can set the flag's value toasia-northeast1. You can set a defaultvalue for the location using the.bigqueryrc file. - format is
ORC. - dataset is an existing dataset.
- table is the name of the table into which you're loading data.
- path_to_source is a fully-qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.
Examples:
The following command loads data fromgs://mybucket/mydata.orc into atable namedmytable inmydataset.
bqload\--source_format=ORC\mydataset.mytable\gs://mybucket/mydata.orcThe following command loads data fromgs://mybucket/mydata.orc into a newingestion-time partitioned table namedmytable inmydataset.
bqload\--source_format=ORC\--time_partitioning_type=DAY\mydataset.mytable\gs://mybucket/mydata.orcThe following command loads data fromgs://mybucket/mydata.orc into apartitioned table namedmytable inmydataset. The table is partitionedon themytimestamp column.
bqload\--source_format=ORC\--time_partitioning_fieldmytimestamp\mydataset.mytable\gs://mybucket/mydata.orcThe following command loads data from multiple files ings://mybucket/into a table namedmytable inmydataset. The Cloud Storage URI uses awildcard.
bqload\--source_format=ORC\mydataset.mytable\gs://mybucket/mydata*.orcThe following command loads data from multiple files ings://mybucket/into a table namedmytable inmydataset. The command includes a comma-separated list of Cloud Storage URIs with wildcards.
bqload--autodetect\--source_format=ORC\mydataset.mytable\"gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"API
Create a
loadjob that points to the source data in Cloud Storage.(Optional) Specify yourlocation inthe
locationproperty in thejobReferencesection of thejob resource.The
source URIsproperty must be fully-qualified, in the formatgs://bucket/object.Each URI can contain one '*'wildcard character.Specify the ORC data format by setting the
sourceFormatproperty toORC.To check the job status, call
jobs.get(job_id*),wherejob_id is the ID of the job returned by the initialrequest.- If
status.state = DONE, the job completed successfully. - If the
status.errorResultproperty is present, the request failed,and that object includes information describing what went wrong.When a request fails, no table is created and no data is loaded. - If
status.errorResultis absent, the job finished successfully,although there might have been some non-fatal errors, such as problemsimporting a few rows. Non-fatal errors are listed in the returned jobobject'sstatus.errorsproperty.
- If
API notes:
Load jobs are atomic and consistent; if a load job fails, none of the datais available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it as
jobReference.jobIdwhen callingjobs.insertto create a load job. Thisapproach is more robust to network failure because the client can poll orretry on the known job ID.Calling
jobs.inserton a given job ID is idempotent. You can retry asmany times as you like on the same job ID, and at most one of thoseoperations succeeds.
C#
Before trying this sample, follow theC# setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryC# API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.usingGoogle.Apis.Bigquery.v2.Data;usingGoogle.Cloud.BigQuery.V2;usingSystem;publicclassBigQueryLoadTableGcsOrc{publicvoidLoadTableGcsOrc(stringprojectId="your-project-id",stringdatasetId="your_dataset_id"){BigQueryClientclient=BigQueryClient.Create(projectId);vargcsURI="gs://cloud-samples-data/bigquery/us-states/us-states.orc";vardataset=client.GetDataset(datasetId);TableReferencedestinationTableRef=dataset.GetTableReference(tableId:"us_states");// Create job configurationvarjobOptions=newCreateLoadJobOptions(){SourceFormat=FileFormat.Orc};// Create and run jobvarloadJob=client.CreateLoadJob(sourceUri:gcsURI,destination:destinationTableRef,// Pass null as the schema because the schema is inferred when// loading Orc dataschema:null,options:jobOptions);loadJob=loadJob.PollUntilCompleted().ThrowOnAnyError();// Waits for the job to complete.// Display the number of rows uploadedBigQueryTabletable=client.GetTable(destinationTableRef);Console.WriteLine($"Loaded {table.Resource.NumRows} rows to {table.FullyQualifiedId}");}}
Go
Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.import("context""fmt""cloud.google.com/go/bigquery")// importORCTruncate demonstrates loading Apache ORC data from Cloud Storage into a table.funcimportORC(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.orc")gcsRef.SourceFormat=bigquery.ORCloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)job,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}
Java
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.Field;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.Schema;importcom.google.cloud.bigquery.StandardSQLTypeName;importcom.google.cloud.bigquery.TableId;// Sample to load ORC data from Cloud Storage into a new BigQuery tablepublicclassLoadOrcFromGCS{publicstaticvoidrunLoadOrcFromGCS(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.orc";Schemaschema=Schema.of(Field.of("name",StandardSQLTypeName.STRING),Field.of("post_abbr",StandardSQLTypeName.STRING));loadOrcFromGCS(datasetName,tableName,sourceUri,schema);}publicstaticvoidloadOrcFromGCS(StringdatasetName,StringtableName,StringsourceUri,Schemaschema){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();TableIdtableId=TableId.of(datasetName,tableName);LoadJobConfigurationloadConfig=LoadJobConfiguration.newBuilder(tableId,sourceUri,FormatOptions.orc()).setSchema(schema).build();// Load data from a GCS ORC file into the tableJobjob=bigquery.create(JobInfo.of(loadConfig));// Blocks until this load table job completes its execution, either failing or succeeding.job=job.waitFor();if(job.isDone() &&job.getStatus().getError()==null){System.out.println("ORC from GCS successfully added during load append job");}else{System.out.println("BigQuery was unable to load into the table due to an error:"+job.getStatus().getError());}}catch(BigQueryException|InterruptedExceptione){System.out.println("Column not added during load append \n"+e.toString());}}}
Node.js
Before trying this sample, follow theNode.js setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryNode.js API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/** * This sample loads the ORC file at * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.orc * * TODO(developer): Replace the following lines with the path to your file. */constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.orc';asyncfunctionloadTableGCSORC(){// Imports a GCS file into a table with ORC source format./** * TODO(developer): Uncomment the following line before running the sample. */// const datasetId = 'my_dataset';// const tableId = 'my_table'// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'ORC',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id} completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors &&errors.length >0){throwerrors;}}
PHP
Before trying this sample, follow thePHP setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPHP API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient([ 'projectId' => $projectId,]);$dataset = $bigQuery->dataset($datasetId);$table = $dataset->table('us_states');// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('ORC');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) { print('Waiting for job to complete' . PHP_EOL); $job->reload(); if (!$job->isComplete()) { throw new Exception('Job has not yet completed', 500); }});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) { $error = $job->info()['status']['errorResult']['message']; printf('Error running job: %s' . PHP_EOL, $error);} else { print('Data imported successfully' . PHP_EOL);}
Python
Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.fromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.ORC)uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{} rows.".format(destination_table.num_rows))
Ruby
Before trying this sample, follow theRuby setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryRuby API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.require"google/cloud/bigquery"defload_table_gcs_orcdataset_id="your_dataset_id"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.datasetdataset_idgcs_uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"table_id="us_states"load_job=dataset.load_jobtable_id,gcs_uri,format:"orc"puts"Starting job#{load_job.job_id}"load_job.wait_until_done!# Waits for table load to complete.puts"Job finished."table=dataset.tabletable_idputs"Loaded#{table.rows_count} rows to table#{table.id}"end
Append to or overwrite a table with ORC data
You can load additional data into a table either from source files or byappending query results.
In the Google Cloud console, use theWrite preference option to specifywhat action to take when you load data from a source file or from a queryresult.
You have the following options when you load additional data into a table:
| Console option | bq tool flag | BigQuery API property | Description |
|---|---|---|---|
| Write if empty | Not supported | WRITE_EMPTY | Writes the data only if the table is empty. |
| Append to table | --noreplace or--replace=false; if--[no]replace is unspecified, the default is append | WRITE_APPEND | (Default) Appends the data to the end of the table. |
| Overwrite table | --replace or--replace=true | WRITE_TRUNCATE | Erases all existing data in a table before writing the new data. This action also deletes the table schema, row level security, and removes any Cloud KMS key. |
If you load data into an existing table, the load job can append the data oroverwrite the table.
You can append or overwrite a table by:
- Using the Google Cloud console
- Using the bq command-line tool's
bq loadcommand - Calling the
jobs.insertAPI method and configuring aloadjob - Using the client libraries
To append or overwrite a table with ORC data:
Console
In the Google Cloud console, go to theBigQuery page.
- In the left pane, clickExplorer.
- In theExplorer pane, expand your project, clickDatasets, and then select a dataset.
- In theDataset info section, clickCreate table.
- In theCreate table pane, specify the following details:
- In theSource section, selectGoogle Cloud Storage in theCreate table from list. Then, do the following:
- Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.

- ForFile format, selectORC.
Note: It is possible to modify the table's schema when you append or overwrite it. For more information about supported schema changes during a load operation, seeModifying table schemas. - Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
- In theDestination section, specify the following details:
- ForDataset, select the dataset in which you want to create the table.
- In theTable field, enter the name of the table that you want to create.
- Verify that theTable type field is set toNative table.
- In theSchema section, no action is necessary. The schema is self-described in ORC files. Note: It is possible to modify the table's schema when you append or overwrite it. For more information about supported schema changes during a load operation, seeModifying table schemas.
- Optional: SpecifyPartition and cluster settings. For more information, seeCreating partitioned tables andCreating and using clustered tables. You cannot convert a table to a partitioned or clustered table by appending or overwriting it. The Google Cloud console does not support appending to or overwriting partitioned or clustered tables in a load job.
- ClickAdvanced options and do the following:
- ForWrite preference, chooseAppend to table orOverwrite table.
- If you want to ignore values in a row that are not present in the table's schema, then selectUnknown values.
- ForEncryption, clickCustomer-managed key to use aCloud Key Management Service key. If you leave theGoogle-managed key setting, BigQueryencrypts the data at rest.
- ClickCreate table.
SQL
Use theLOAD DATA DDL statement.The following example appends an ORC file to the tablemytable:
In the Google Cloud console, go to theBigQuery page.
In the query editor, enter the following statement:
LOADDATAINTOmydataset.mytableFROMFILES(format='ORC',uris=['gs://bucket/path/file.orc']);
ClickRun.
For more information about how to run queries, seeRun an interactive query.
bq
Enter thebq load command with the--replace flag to overwrite thetable. Use the--noreplace flag to append data to the table. If no flag isspecified, the default is to append data. Supply the--source_format flagand set it toORC. Because ORC schemas are automatically retrievedfrom the self-describing source data, you don't need to provide a schemadefinition.
(Optional) Supply the--location flag and set the value to yourlocation.
Other optional flags include:
--destination_kms_key: The Cloud KMS key for encryption of thetable data.
bq--location=locationload\--[no]replace\--source_format=format\dataset.table\path_to_source
Where:
- location is yourlocation.The
--locationflag is optional. You can set a default value for thelocation by using the.bigqueryrc file. - format is
ORC. - dataset is an existing dataset.
- table is the name of the table into which you're loading data.
- path_to_source is a fully-qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.
Examples:
The following command loads data fromgs://mybucket/mydata.orc andoverwrites a table namedmytable inmydataset.
bqload\--replace\--source_format=ORC\mydataset.mytable\gs://mybucket/mydata.orcThe following command loads data fromgs://mybucket/mydata.orc andappends data to a table namedmytable inmydataset.
bqload\--noreplace\--source_format=ORC\mydataset.mytable\gs://mybucket/mydata.orcFor information about appending and overwriting partitioned tables using thebq command-line tool, see:Appending to and overwriting partitioned table data.
API
Create a
loadjob that points to the source data in Cloud Storage.(Optional) Specify yourlocationin the
locationproperty in thejobReferencesection of thejob resource.The
source URIspropertymust be fully-qualified, in the formatgs://bucket/object. You caninclude multiple URIs as a comma-separated list. Note thatwildcards arealso supported.Specify the data format by setting the
configuration.load.sourceFormatproperty toORC.Specify the write preference by setting the
configuration.load.writeDispositionproperty toWRITE_TRUNCATEorWRITE_APPEND.
C#
Before trying this sample, follow theC# setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryC# API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.usingGoogle.Apis.Bigquery.v2.Data;usingGoogle.Cloud.BigQuery.V2;usingSystem;publicclassBigQueryLoadTableGcsOrcTruncate{publicvoidLoadTableGcsOrcTruncate(stringprojectId="your-project-id",stringdatasetId="your_dataset_id",stringtableId="your_table_id"){BigQueryClientclient=BigQueryClient.Create(projectId);vargcsURI="gs://cloud-samples-data/bigquery/us-states/us-states.orc";vardataset=client.GetDataset(datasetId);TableReferencedestinationTableRef=dataset.GetTableReference(tableId:"us_states");// Create job configurationvarjobOptions=newCreateLoadJobOptions(){SourceFormat=FileFormat.Orc,WriteDisposition=WriteDisposition.WriteTruncate};// Create and run jobvarloadJob=client.CreateLoadJob(sourceUri:gcsURI,destination:destinationTableRef,// Pass null as the schema because the schema is inferred when// loading Orc dataschema:null,options:jobOptions);loadJob=loadJob.PollUntilCompleted().ThrowOnAnyError();// Waits for the job to complete.// Display the number of rows uploadedBigQueryTabletable=client.GetTable(destinationTableRef);Console.WriteLine($"Loaded {table.Resource.NumRows} rows to {table.FullyQualifiedId}");}}
Go
Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.import("context""fmt""cloud.google.com/go/bigquery")// importORCTruncate demonstrates loading Apache ORC data from Cloud Storage into a table// and overwriting/truncating existing data in the table.funcimportORCTruncate(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.orc")gcsRef.SourceFormat=bigquery.ORCloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)// Default for import jobs is to append data to a table. WriteTruncate// specifies that existing data should instead be replaced/overwritten.loader.WriteDisposition=bigquery.WriteTruncatejob,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}
Java
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;// Sample to overwrite the BigQuery table data by loading a ORC file from GCSpublicclassLoadOrcFromGcsTruncate{publicstaticvoidrunLoadOrcFromGcsTruncate(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.orc";loadOrcFromGcsTruncate(datasetName,tableName,sourceUri);}publicstaticvoidloadOrcFromGcsTruncate(StringdatasetName,StringtableName,StringsourceUri){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();TableIdtableId=TableId.of(datasetName,tableName);LoadJobConfigurationloadConfig=LoadJobConfiguration.newBuilder(tableId,sourceUri).setFormatOptions(FormatOptions.orc())// Set the write disposition to overwrite existing table data.setWriteDisposition(JobInfo.WriteDisposition.WRITE_TRUNCATE).build();// Load data from a GCS ORC file into the tableJobjob=bigquery.create(JobInfo.of(loadConfig));// Blocks until this load table job completes its execution, either failing or succeeding.job=job.waitFor();if(job.isDone() &&job.getStatus().getError()==null){System.out.println("Table is successfully overwritten by ORC file loaded from GCS");}else{System.out.println("BigQuery was unable to load into the table due to an error:"+job.getStatus().getError());}}catch(BigQueryException|InterruptedExceptione){System.out.println("Column not added during load append \n"+e.toString());}}}
Node.js
Before trying this sample, follow theNode.js setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryNode.js API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate the clientsconstbigquery=newBigQuery();conststorage=newStorage();/** * This sample loads the CSV file at * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv * * TODO(developer): Replace the following lines with the path to your file. */constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.orc';asyncfunctionloadORCFromGCSTruncate(){/** * Imports a GCS file into a table and overwrites * table data if table already exists. *//** * TODO(developer): Uncomment the following lines before running the sample. */// const datasetId = "my_dataset";// const tableId = "my_table";// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'ORC',// Set the write disposition to overwrite existing table data.writeDisposition:'WRITE_TRUNCATE',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id} completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors &&errors.length >0){throwerrors;}}
PHP
Before trying this sample, follow thePHP setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPHP API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// $tableID = 'The BigQuery table ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient([ 'projectId' => $projectId,]);$table = $bigQuery->dataset($datasetId)->table($tableId);// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('ORC')->writeDisposition('WRITE_TRUNCATE');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) { print('Waiting for job to complete' . PHP_EOL); $job->reload(); if (!$job->isComplete()) { throw new Exception('Job has not yet completed', 500); }});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) { $error = $job->info()['status']['errorResult']['message']; printf('Error running job: %s' . PHP_EOL, $error);} else { print('Data imported successfully' . PHP_EOL);}
Python
Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.importiofromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(schema=[bigquery.SchemaField("name","STRING"),bigquery.SchemaField("post_abbr","STRING"),],)body=io.BytesIO(b"Washington,WA")client.load_table_from_file(body,table_id,job_config=job_config).result()previous_rows=client.get_table(table_id).num_rowsassertprevious_rows >0job_config=bigquery.LoadJobConfig(write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,source_format=bigquery.SourceFormat.ORC,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{} rows.".format(destination_table.num_rows))
Ruby
Before trying this sample, follow theRuby setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryRuby API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.require"google/cloud/bigquery"defload_table_gcs_orc_truncatedataset_id="your_dataset_id",table_id="your_table_id"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.datasetdataset_idgcs_uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"load_job=dataset.load_jobtable_id,gcs_uri,format:"orc",write:"truncate"puts"Starting job#{load_job.job_id}"load_job.wait_until_done!# Waits for table load to complete.puts"Job finished."table=dataset.tabletable_idputs"Loaded#{table.rows_count} rows to table#{table.id}"end
Load hive-partitioned ORC data
BigQuery supports loading hive partitioned ORC data stored onCloud Storage and populates the hive partitioning columns as columns inthe destination BigQuery managed table. For more information, seeLoading Externally Partitioned Data from Cloud Storage.
ORC conversions
BigQuery converts ORC data types to the followingBigQuery data types:
Primitive types
| BigQuery data type | Notes | |
|---|---|---|
| boolean | BOOLEAN | |
| byte | INTEGER | |
| short | INTEGER | |
| int | INTEGER | |
| long | INTEGER | |
| float | FLOAT | |
| double | FLOAT | |
| string | STRING | UTF-8 only |
| varchar | STRING | UTF-8 only |
| char | STRING | UTF-8 only |
| binary | BYTES | |
| date | DATE | An attempt to convert any value in the ORC data that is less than -719162 days or greater than 2932896 days returns aninvalid date value error. If this affects you, contactSupport to have unsupported values converted to the BigQuery minimum value of0001-01-01 or maximum value of9999-12-31, as appropriate. |
| timestamp | TIMESTAMP | ORC supports nanosecond precision, but BigQuery converts sub-microsecond values to microseconds when the data is read. An attempt to convert any value in the ORC data that is less than -719162 days or greater than 2932896 days returns an |
| decimal | NUMERIC, BIGNUMERIC, or STRING | SeeDecimal type. |
Decimal type
Decimal logical types can be converted toNUMERIC,BIGNUMERIC, orSTRING types. The converted type dependson the precision and scale parameters of thedecimal logical type and thespecified decimal target types. Specify the decimal target type as follows:
- For aload job using the
jobs.insertAPI: use theJobConfigurationLoad.decimalTargetTypesfield. - For aload job using the
bq loadcommand in the bq command-line tool:use the--decimal_target_typesflag. - For a query against atable with external sources:use the
ExternalDataConfiguration.decimalTargetTypesfield. - For apersistent external table created with DDL:use the
decimal_target_typesoption.
Complex types
| BigQuery data type | Notes | |
|---|---|---|
| struct | RECORD |
|
| map<K,V> | RECORD | An ORC map<K,V> field is converted to a repeated RECORD that contains two fields: a key of the same data type as K, and a value of the same data type as V. Both fields are NULLABLE. |
| list | repeated fields | Nested lists and lists of maps are not supported. |
| union | RECORD |
|
Column names
A column name can contain letters (a-z, A-Z), numbers (0-9), or underscores(_), and it must start with a letter or underscore. If you use flexible columnnames, BigQuery supports starting a column name with a number.Exercise caution when starting columns with a number, since using flexiblecolumn names with the BigQuery Storage Read API orBigQuery Storage Write API requires special handling. For more information aboutflexible column name support, seeflexible column names.
Column names have a maximum length of 300 characters. Column names can't use anyof the following prefixes:
_TABLE__FILE__PARTITION_ROW_TIMESTAMP__ROOT___COLIDENTIFIER_CHANGE_SEQUENCE_NUMBER_CHANGE_TYPE_CHANGE_TIMESTAMP
Duplicate column names are not allowed even if the case differs. For example, acolumn namedColumn1 is considered identical to a column namedcolumn1. Tolearn more about column naming rules, seeColumnnames in theGoogleSQL reference.
If a table name (for example,test) is the same as one of its column names(for example,test), theSELECT expression interprets thetest column asaSTRUCT containing all other table columns. To avoid this collision, useone of the following methods:
Avoid using the same name for a table and its columns.
Avoid using
_field_as a column name prefix. System-reserved prefixescause automatic renaming during queries. For example, theSELECT _field_ FROM project1.dataset.testquery returns a column named_field_1. If you must query a column with this name, use an alias tocontrol the output.Assign the table a different alias. For example, the following query assignsa table alias
tto the tableproject1.dataset.test:SELECTtestFROMproject1.dataset.testASt;Include the table name when referencing a column. For example:
SELECTtest.testFROMproject1.dataset.test;
Flexible column names
You have more flexibility in what you name columns, including expanded accessto characters in languages other than English as well as additional symbols.Make sure to use backtick (`) characters to enclose flexible column names if they areQuoted Identifiers.
Flexible column names support the following characters:
- Any letter in any language, as represented by the Unicode regular expression
\p{L}. - Any numeric character in any language as represented by the Unicode regularexpression
\p{N}. - Any connector punctuation character, including underscores, as representedby the Unicode regular expression
\p{Pc}. - A hyphen or dash as represented by the Unicode regular expression
\p{Pd}. - Any mark intended to accompany another character as represented by theUnicode regular expression
\p{M}.For example, accents, umlauts, or enclosing boxes. - The following special characters:
- An ampersand (
&) as represented by the Unicode regularexpression\u0026. - A percent sign (
%) as represented by the Unicode regularexpression\u0025. - An equals sign (
=) as represented by the Unicode regularexpression\u003D. - A plus sign (
+) as represented by the Unicode regularexpression\u002B. - A colon (
:) as represented by the Unicode regularexpression\u003A. - An apostrophe (
') as represented by the Unicode regularexpression\u0027. - A less-than sign (
<) as represented by the Unicode regularexpression\u003C. - A greater-than sign (
>) as represented by the Unicode regularexpression\u003E. - A number sign (
#) as represented by the Unicode regularexpression\u0023. - A vertical line (
|) as represented by the Unicode regularexpression\u007c. - Whitespace.
- An ampersand (
Flexible column names don't support the following special characters:
- An exclamation mark (
!) as represented by the Unicode regularexpression\u0021. - A quotation mark (
") as represented by the Unicode regularexpression\u0022. - A dollar sign (
$) as represented by the Unicode regularexpression\u0024. - A left parenthesis (
() as represented by the Unicode regularexpression\u0028. - A right parenthesis (
)) as represented by the Unicode regularexpression\u0029. - An asterisk (
*) as represented by the Unicode regularexpression\u002A. - A comma (
,) as represented by the Unicode regularexpression\u002C. - A period (
.) as represented by the Unicode regularexpression\u002E. Periods arenot replaced by underscores in Parquet filecolumn names when a column name character map is used. For more information,seeflexible column limitations. - A slash (
/) as represented by the Unicode regularexpression\u002F. - A semicolon (
;) as represented by the Unicode regularexpression\u003B. - A question mark (
?) as represented by the Unicode regularexpression\u003F. - An at sign (
@) as represented by the Unicode regularexpression\u0040. - A left square bracket (
[) as represented by the Unicode regularexpression\u005B. - A backslash (
\) as represented by the Unicode regularexpression\u005C. - A right square bracket (
]) as represented by the Unicode regularexpression\u005D. - A circumflex accent (
^) as represented by the Unicode regularexpression\u005E. - A grave accent (
`) as represented by the Unicode regularexpression\u0060. - A left curly bracket {
{) as represented by the Unicode regularexpression\u007B. - A right curly bracket (
}) as represented by the Unicode regularexpression\u007D. - A tilde (
~) as represented by the Unicode regular expression\u007E.
For additional guidelines, seeColumn names.
The expanded column characters are supported by both the BigQuery Storage Read APIand the BigQuery Storage Write API. To use the expanded list of Unicode characterswith the BigQuery Storage Read API, you must set a flag. You can use thedisplayName attribute to retrieve the column name. The following exampleshows how to set a flag with the Python client:
fromgoogle.cloud.bigquery_storageimporttypesrequested_session=types.ReadSession()#set avro serialization options for flexible column.options=types.AvroSerializationOptions()options.enable_display_name_attribute=Truerequested_session.read_options.avro_serialization_options=optionsTo use the expanded list of Unicode characters with the BigQuery Storage Write API,you must provide the schema withcolumn_name notation, unless you are usingtheJsonStreamWriter writer object. The following example shows how toprovide the schema:
syntax="proto2";packagemypackage;// Source protos located in github.com/googleapis/googleapisimport"google/cloud/bigquery/storage/v1/annotations.proto";messageFlexibleSchema{optionalstringitem_name_column=1[(.google.cloud.bigquery.storage.v1.column_name)="name-列"];optionalstringitem_description_column=2[(.google.cloud.bigquery.storage.v1.column_name)="description-列"];}In this example,item_name_column anditem_description_column areplaceholder names which need to be compliant with theprotocol buffer namingconvention. Note thatcolumn_name annotations always take precedence overplaceholder names.
Limitations
- Flexible column names are not supported withexternal tables.
NULL values
Note that for load jobs, BigQuery ignoresNULL elements for thelist compound type, since otherwise they would be translated toNULLARRAYelements which cannot persist to a table (seeData Types fordetails).
For more information on ORC data types, see theApache ORC™ Specification v1.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.