Loading Parquet data from Cloud Storage
This page provides an overview of loading Parquet data from Cloud Storageinto BigQuery.
Parquet is anopen source column-oriented data format that is widely used in theApache Hadoop ecosystem.
When you load Parquet data from Cloud Storage, you can load the data into anew table or partition, or you can append to or overwrite an existing table orpartition. When your data is loaded into BigQuery, it isconverted into columnar format forCapacitor(BigQuery's storage format).
When you load data from Cloud Storage into a BigQuery table,the dataset that contains the table must be in the same regional or multi-regional location as the Cloud Storage bucket.
For information about loading Parquet data from a local file, seeLoading data from local files.
Limitations
You are subject to the following limitations when you load data intoBigQuery from a Cloud Storage bucket:
- BigQuery does not guarantee data consistency for external datasources. Changes to the underlying data while a query is running can result inunexpected behavior.
BigQuery doesn't supportCloud Storage object versioning. If youinclude a generation number in the Cloud Storage URI, then the load jobfails.
You can't use a wildcard in the Cloud Storage URI if any of thefiles to be loaded have different schemas. Any difference in the position ofcolumns qualifies as a different schema.
Input file requirements
To avoidresourcesExceeded errors when loading Parquet files intoBigQuery, follow these guidelines:
- Keep row sizes to 50 MB or less.
- If your input data contains more than 100 columns, consider reducing the pagesize to be smaller than the default page size (1 * 1024 * 1024 bytes). Thisis especially helpful if you are using significant compression.
- For optimal performance, aim for row group sizes of at least 16 MiB.Smaller row group sizes increase I/O and slow down loads and queries.
Before you begin
Grant Identity and Access Management (IAM) roles that give users the necessarypermissions to perform each task in this document, and create a datasetto store your data.
Required permissions
To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.
Permissions to load data into BigQuery
To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:
bigquery.tables.createbigquery.tables.updateDatabigquery.tables.updatebigquery.jobs.create
Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:
roles/bigquery.dataEditorroles/bigquery.dataOwnerroles/bigquery.admin(includes thebigquery.jobs.createpermission)bigquery.user(includes thebigquery.jobs.createpermission)bigquery.jobUser(includes thebigquery.jobs.createpermission)
Additionally, if you have thebigquery.datasets.create permission, you can create andupdate tables using a load job in the datasets that you create.
For more information on IAM roles and permissions inBigQuery, seePredefined roles and permissions.
Permissions to load data from Cloud Storage
To get the permissions that you need to load data from a Cloud Storage bucket, ask your administrator to grant you theStorage Admin (roles/storage.admin) IAM role on the bucket. For more information about granting roles, seeManage access to projects, folders, and organizations.
This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand theRequired permissions section:
Required permissions
The following permissions are required to load data from a Cloud Storage bucket:
storage.buckets.getstorage.objects.getstorage.objects.list (required if you are using a URIwildcard)
You might also be able to get these permissions withcustom roles or otherpredefined roles.
Create a dataset
Create aBigQuery dataset to storeyour data.
Parquet schemas
When you load Parquet files into BigQuery, the table schema isautomatically retrieved from the self-describing source data. WhenBigQuery retrieves the schema from the source data, thealphabetically last file is used.
For example, you have the following Parquet files in Cloud Storage:
gs://mybucket/00/ a.parquet z.parquetgs://mybucket/01/ b.parquet
Running this command in the bq command-line tool loads all of the files (as acomma-separated list), and the schema is derived frommybucket/01/b.parquet:
bqload\--source_format=PARQUET\dataset.table\"gs://mybucket/00/*.parquet","gs://mybucket/01/*.parquet"
When you load multiple Parquet files that have different schemas, identicalcolumns specified in multiple schemas must have the samemodein each schema definition.
When BigQuery detects the schema, some Parquet data types areconverted to BigQuery data types to make them compatible withGoogleSQL syntax. For more information, seeParquet conversions.
To provide a table schema for creating external tables, set thereferenceFileSchemaUriproperty in BigQuery API or--reference_file_schema_uri parameter in bq command-line toolto the URL of the reference file.For example,--reference_file_schema_uri="gs://mybucket/schema.parquet".
Parquet compression
BigQuery supports the following compression codecs forParquet file contents:
GZipLZO_1CLZO_1XLZ4_RAWSnappyZSTD
Loading Parquet data into a new table
You can load Parquet data into a new table by using one of the following:
- The Google Cloud console
- The bq command-line tool's
bq loadcommand - The
jobs.insertAPI method and configuring aloadjob - The client libraries
To load Parquet data from Cloud Storage into a new BigQuerytable:
Console
In the Google Cloud console, go to theBigQuery page.
- In the left pane, clickExplorer.
- In theExplorer pane, expand your project, clickDatasets, and then select a dataset.
- In theDataset info section, clickCreate table.
- In theCreate table pane, specify the following details:
- In theSource section, selectGoogle Cloud Storage in theCreate table from list. Then, do the following:
- Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.

- ForFile format, selectParquet.
- Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
- In theDestination section, specify the following details:
- ForDataset, select the dataset in which you want to create the table.
- In theTable field, enter the name of the table that you want to create.
- Verify that theTable type field is set toNative table.
- In theSchema section, no action is necessary. The schema is self-described in Parquet files.
- Optional: SpecifyPartition and cluster settings. For more information, seeCreating partitioned tables andCreating and using clustered tables.
- ClickAdvanced options and do the following:
- ForWrite preference, leaveWrite if empty selected. This option creates a new table and loads your data into it.
- If you want to ignore values in a row that are not present in the table's schema, then selectUnknown values.
- ForEncryption, clickCustomer-managed key to use aCloud Key Management Service key. If you leave theGoogle-managed key setting, BigQueryencrypts the data at rest.
- ClickCreate table.
After the table is created, you can update the table's expiration,description, and labels, but you cannot add a partition expiration after atable is created using the Google Cloud console. For more information, seeManaging tables.
SQL
Use theLOAD DATA DDL statement.The following example loads a Parquet file into the new tablemytable:
In the Google Cloud console, go to theBigQuery page.
In the query editor, enter the following statement:
LOADDATAOVERWRITEmydataset.mytableFROMFILES(format='PARQUET',uris=['gs://bucket/path/file.parquet']);
ClickRun.
For more information about how to run queries, seeRun an interactive query.
bq
Use thebq load command, specifyPARQUET using the--source_formatflag, and include aCloud Storage URI.You can include a single URI, a comma-separated list of URIs, or a URIcontaining awildcard.
(Optional) Supply the--location flag and set the value to yourlocation.
Other optional flags include:
--time_partitioning_type: Enables time-based partitioning on a tableand sets the partition type. Possible values areHOUR,DAY,MONTH,andYEAR. This flag is optional when you create atable partitioned on aDATE,DATETIME, orTIMESTAMPcolumn. Thedefault partition type for time-based partitioning isDAY. You cannotchange the partitioning specification on an existing table.--time_partitioning_expiration: An integer that specifies (in seconds)when a time-based partition should be deleted. The expiration timeevaluates to the partition's UTC date plus the integer value.--time_partitioning_field: TheDATEorTIMESTAMPcolumn used tocreate a partitioned table. If time-based partitioning is enabled withoutthis value, an ingestion-time partitioned table is created.--require_partition_filter: When enabled, this option requires usersto include aWHEREclause that specifies the partitions to query.Requiring a partition filter can reduce cost and improve performance.For more information, seeRequire a partition filter in queries.--clustering_fields: A comma-separated list of up to four column namesused to create aclustered table.--destination_kms_key: The Cloud KMS key for encryption of thetable data.--column_name_character_map: Defines the scope and handling ofcharacters in column names, with the option of enablingflexible column names.For more information, seeload_option_list.For more information on supported and unsupported characters, seeflexible column names.For more information on partitioned tables, see:
For more information on clustered tables, see:
For more information on table encryption, see:
To load Parquet data into BigQuery, enter the following command:
bq--location=LOCATIONload\--source_format=FORMAT\DATASET.TABLE\PATH_TO_SOURCE
Replace the following:
LOCATION: your location. The--locationflag isoptional. For example, if you are using BigQuery in theTokyo region, you can set the flag's value toasia-northeast1. You canset a default value for the location using the.bigqueryrc file.FORMAT:PARQUET.DATASET: an existing dataset.TABLE: the name of the table into which you'reloading data.PATH_TO_SOURCE: a fully qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.
Examples:
The following command loads data fromgs://mybucket/mydata.parquet into atable namedmytable inmydataset.
bqload\--source_format=PARQUET\mydataset.mytable\gs://mybucket/mydata.parquetThe following command loads data fromgs://mybucket/mydata.parquet into anew ingestion-time partitioned table namedmytable inmydataset.
bqload\--source_format=PARQUET\--time_partitioning_type=DAY\mydataset.mytable\gs://mybucket/mydata.parquetThe following command loads data fromgs://mybucket/mydata.parquet into apartitioned table namedmytable inmydataset. The table is partitionedon themytimestamp column.
bqload\--source_format=PARQUET\--time_partitioning_fieldmytimestamp\mydataset.mytable\gs://mybucket/mydata.parquetThe following command loads data from multiple files ings://mybucket/into a table namedmytable inmydataset. The Cloud Storage URI uses awildcard.
bqload\--source_format=PARQUET\mydataset.mytable\gs://mybucket/mydata*.parquetThe following command loads data from multiple files ings://mybucket/into a table namedmytable inmydataset. The command includes a comma-separated list of Cloud Storage URIs with wildcards.
bqload\--source_format=PARQUET\mydataset.mytable\"gs://mybucket/00/*.parquet","gs://mybucket/01/*.parquet"API
Create a
loadjob that points to the source data in Cloud Storage.(Optional) Specify yourlocation inthe
locationproperty in thejobReferencesection of thejob resource.The
source URIsproperty must be fully qualified, in the formatgs://BUCKET/OBJECT.Each URI can contain one '*'wildcard character.Specify the Parquet data format by setting the
sourceFormatproperty toPARQUET.To check the job status, call
jobs.get(JOB_ID*),replacingJOB_ID with the ID of the job returned by the initialrequest.- If
status.state = DONE, the job completed successfully. - If the
status.errorResultproperty is present, the request failed,and that object includes information describing what went wrong.When a request fails, no table is created and no data is loaded. - If
status.errorResultis absent, the job finished successfully;although, there might have been some nonfatal errors, such as problemsimporting a few rows. Nonfatal errors are listed in the returned jobobject'sstatus.errorsproperty.
- If
API notes:
Load jobs are atomic and consistent: if a load job fails, none of the datais available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it as
jobReference.jobIdwhen callingjobs.insertto create a load job. Thisapproach is more robust to network failure because the client can poll orretry on the known job ID.Calling
jobs.inserton a given job ID is idempotent. You can retry asmany times as you like on the same job ID, and at most one of thoseoperations will succeed.
Go
Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.import("context""fmt""cloud.google.com/go/bigquery")// importParquet demonstrates loading Apache Parquet data from Cloud Storage into a table.funcimportParquet(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.parquet")gcsRef.SourceFormat=bigquery.ParquetgcsRef.AutoDetect=trueloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)job,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}
Java
Before trying this sample, follow theJava setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryJava API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;importjava.math.BigInteger;publicclassLoadParquet{publicstaticvoidrunLoadParquet(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";loadParquet(datasetName);}publicstaticvoidloadParquet(StringdatasetName){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet";TableIdtableId=TableId.of(datasetName,"us_states");LoadJobConfigurationconfiguration=LoadJobConfiguration.builder(tableId,sourceUri).setFormatOptions(FormatOptions.parquet()).build();// For more information on Job see:// https://googleapis.dev/java/google-cloud-clients/latest/index.html?com/google/cloud/bigquery/package-summary.html// Load the tableJobjob=bigquery.create(JobInfo.of(configuration));// Blocks until this load table job completes its execution, either failing or succeeding.JobcompletedJob=job.waitFor();if(completedJob==null){System.out.println("Job not executed since it no longer exists.");return;}elseif(completedJob.getStatus().getError()!=null){System.out.println("BigQuery was unable to load the table due to an error: \n"+job.getStatus().getError());return;}// Check number of rows loaded into the tableBigIntegernumRows=bigquery.getTable(tableId).getNumRows();System.out.printf("Loaded %d rows. \n",numRows);System.out.println("GCS parquet loaded successfully.");}catch(BigQueryException|InterruptedExceptione){System.out.println("GCS Parquet was not loaded. \n"+e.toString());}}}
Node.js
Before trying this sample, follow theNode.js setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryNode.js API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/** * This sample loads the Parquet file at * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.parquet * * TODO(developer): Replace the following lines with the path to your file. */constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.parquet';asyncfunctionloadTableGCSParquet(){// Imports a GCS file into a table with Parquet source format./** * TODO(developer): Uncomment the following lines before running the sample. */// const datasetId = 'my_dataset';// const tableId = 'my_table';// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'PARQUET',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id} completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors &&errors.length >0){throwerrors;}}
PHP
Before trying this sample, follow thePHP setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPHP API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient([ 'projectId' => $projectId,]);$dataset = $bigQuery->dataset($datasetId);$table = $dataset->table('us_states');// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('PARQUET');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) { print('Waiting for job to complete' . PHP_EOL); $job->reload(); if (!$job->isComplete()) { throw new Exception('Job has not yet completed', 500); }});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) { $error = $job->info()['status']['errorResult']['message']; printf('Error running job: %s' . PHP_EOL, $error);} else { print('Data imported successfully' . PHP_EOL);}
Python
Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.PARQUET and pass the job config as thejob_config argument to theload_table_from_uri() method.fromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_name"job_config=bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.PARQUET,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{} rows.".format(destination_table.num_rows))
Appending to or overwriting a table with Parquet data
You can load additional data into a table either from source files or byappending query results.
In the Google Cloud console, use theWrite preference option to specifywhat action to take when you load data from a source file or from a queryresult.
You have the following options when you load additional data into a table:
| Console option | bq tool flag | BigQuery API property | Description |
|---|---|---|---|
| Write if empty | Not supported | WRITE_EMPTY | Writes the data only if the table is empty. |
| Append to table | --noreplace or--replace=false; if--[no]replace is unspecified, the default is append | WRITE_APPEND | (Default) Appends the data to the end of the table. |
| Overwrite table | --replace or--replace=true | WRITE_TRUNCATE | Erases all existing data in a table before writing the new data. This action also deletes the table schema, row level security, and removes any Cloud KMS key. |
If you load data into an existing table, the load job can append the data oroverwrite the table.
You can append or overwrite a table by using one of the following:
- The Google Cloud console
- The bq command-line tool's
bq loadcommand - The
jobs.insertAPI method and configuring aloadjob - The client libraries
To append or overwrite a table with Parquet data:
Console
In the Google Cloud console, go to theBigQuery page.
- In the left pane, clickExplorer.
- In theExplorer pane, expand your project, clickDatasets, and then select a dataset.
- In theDataset info section, clickCreate table.
- In theCreate table pane, specify the following details:
- In theSource section, selectGoogle Cloud Storage in theCreate table from list. Then, do the following:
- Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.

- ForFile format, selectParquet.
Note: It is possible to modify the table's schema when you append or overwrite it. For more information about supported schema changes during a load operation, seeModifying table schemas. - Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
- In theDestination section, specify the following details:
- ForDataset, select the dataset in which you want to create the table.
- In theTable field, enter the name of the table that you want to create.
- Verify that theTable type field is set toNative table.
- In theSchema section, no action is necessary. The schema is self-described in Parquet files. Note: It is possible to modify the table's schema when you append or overwrite it. For more information about supported schema changes during a load operation, seeModifying table schemas.
- Optional: SpecifyPartition and cluster settings. For more information, seeCreating partitioned tables andCreating and using clustered tables. You cannot convert a table to a partitioned or clustered table by appending or overwriting it. The Google Cloud console does not support appending to or overwriting partitioned or clustered tables in a load job.
- ClickAdvanced options and do the following:
- ForWrite preference, chooseAppend to table orOverwrite table.
- If you want to ignore values in a row that are not present in the table's schema, then selectUnknown values.
- ForEncryption, clickCustomer-managed key to use aCloud Key Management Service key. If you leave theGoogle-managed key setting, BigQueryencrypts the data at rest.
- ClickCreate table.
SQL
Use theLOAD DATA DDL statement.The following example appends a Parquet file to the tablemytable:
In the Google Cloud console, go to theBigQuery page.
In the query editor, enter the following statement:
LOADDATAINTOmydataset.mytableFROMFILES(format='PARQUET',uris=['gs://bucket/path/file.parquet']);
ClickRun.
For more information about how to run queries, seeRun an interactive query.
bq
Enter thebq load command with the--replace flag to overwrite thetable. Use the--noreplace flag to append data to the table. If no flag isspecified, the default is to append data. Supply the--source_format flagand set it toPARQUET. Because Parquet schemas are automatically retrievedfrom the self-describing source data, you don't need to provide a schemadefinition.
(Optional) Supply the--location flag and set the value to yourlocation.
Other optional flags include:
--destination_kms_key: The Cloud KMS key for encryption of thetable data.
bq--location=LOCATIONload\--[no]replace\--source_format=FORMAT\DATASET.TABLE\PATH_TO_SOURCE
Replace the following:
location: yourlocation.The--locationflag is optional. You can set a default value for thelocation by using the.bigqueryrc file.format:PARQUET.dataset: an existing dataset.table: the name of the table into which you'reloading data.path_to_source: a fully qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.
Examples:
The following command loads data fromgs://mybucket/mydata.parquet andoverwrites a table namedmytable inmydataset.
bqload\--replace\--source_format=PARQUET\mydataset.mytable\gs://mybucket/mydata.parquetThe following command loads data fromgs://mybucket/mydata.parquet andappends data to a table namedmytable inmydataset.
bqload\--noreplace\--source_format=PARQUET\mydataset.mytable\gs://mybucket/mydata.parquetFor information on appending and overwriting partitioned tables using thebq command-line tool, seeAppending to and overwriting partitioned table data.
API
Create a
loadjob that points to the source data in Cloud Storage.(Optional) Specify yourlocation inthe
locationproperty in thejobReferencesection of thejob resource.The
source URIspropertymust be fully qualified, in the formatgs://BUCKET/OBJECT. You caninclude multiple URIs as a comma-separated list. Note thatwildcards arealso supported.Specify the data format by setting the
configuration.load.sourceFormatproperty toPARQUET.Specify the write preference by setting the
configuration.load.writeDispositionproperty toWRITE_TRUNCATEorWRITE_APPEND.
Go
Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.import("context""fmt""cloud.google.com/go/bigquery")// importParquetTruncate demonstrates loading Apache Parquet data from Cloud Storage into a table// and overwriting/truncating existing data in the table.funcimportParquetTruncate(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.parquet")gcsRef.SourceFormat=bigquery.ParquetgcsRef.AutoDetect=trueloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)loader.WriteDisposition=bigquery.WriteTruncatejob,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}
Java
Before trying this sample, follow theJava setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryJava API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.JobInfo.WriteDisposition;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;importjava.math.BigInteger;publicclassLoadParquetReplaceTable{publicstaticvoidrunLoadParquetReplaceTable(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";loadParquetReplaceTable(datasetName);}publicstaticvoidloadParquetReplaceTable(StringdatasetName){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();// Imports a GCS file into a table and overwrites table data if table already exists.// This sample loads CSV file at:// https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csvStringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet";TableIdtableId=TableId.of(datasetName,"us_states");// For more information on LoadJobConfiguration see:// https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/bigquery/LoadJobConfiguration.Builder.htmlLoadJobConfigurationconfiguration=LoadJobConfiguration.builder(tableId,sourceUri).setFormatOptions(FormatOptions.parquet())// Set the write disposition to overwrite existing table data..setWriteDisposition(WriteDisposition.WRITE_TRUNCATE).build();// For more information on Job see:// https://googleapis.dev/java/google-cloud-clients/latest/index.html?com/google/cloud/bigquery/package-summary.html// Load the tableJobjob=bigquery.create(JobInfo.of(configuration));// Load data from a GCS parquet file into the table// Blocks until this load table job completes its execution, either failing or succeeding.JobcompletedJob=job.waitFor();if(completedJob==null){System.out.println("Job not executed since it no longer exists.");return;}elseif(completedJob.getStatus().getError()!=null){System.out.println("BigQuery was unable to load into the table due to an error: \n"+job.getStatus().getError());return;}// Check number of rows loaded into the tableBigIntegernumRows=bigquery.getTable(tableId).getNumRows();System.out.printf("Loaded %d rows. \n",numRows);System.out.println("GCS parquet overwrote existing table successfully.");}catch(BigQueryException|InterruptedExceptione){System.out.println("Table extraction job was interrupted. \n"+e.toString());}}}Node.js
Before trying this sample, follow theNode.js setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryNode.js API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/** * This sample loads the CSV file at * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv * * TODO(developer): Replace the following lines with the path to your file. */constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.parquet';asyncfunctionloadParquetFromGCSTruncate(){/** * Imports a GCS file into a table and overwrites * table data if table already exists. *//** * TODO(developer): Uncomment the following lines before running the sample. */// const datasetId = "my_dataset";// const tableId = "my_table";// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'PARQUET',// Set the write disposition to overwrite existing table data.writeDisposition:'WRITE_TRUNCATE',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id} completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors &&errors.length >0){throwerrors;}}
PHP
Before trying this sample, follow thePHP setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPHP API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// $tableID = 'The BigQuery table ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient([ 'projectId' => $projectId,]);$table = $bigQuery->dataset($datasetId)->table($tableId);// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('PARQUET')->writeDisposition('WRITE_TRUNCATE');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) { print('Waiting for job to complete' . PHP_EOL); $job->reload(); if (!$job->isComplete()) { throw new Exception('Job has not yet completed', 500); }});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) { $error = $job->info()['status']['errorResult']['message']; printf('Error running job: %s' . PHP_EOL, $error);} else { print('Data imported successfully' . PHP_EOL);}
Python
Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation. To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.LoadJobConfig.write_disposition propertytoWRITE_APPEND.
To replace the rows in an existing table, set theLoadJobConfig.write_disposition propertytoWRITE_TRUNCATE.
importiofromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(schema=[bigquery.SchemaField("name","STRING"),bigquery.SchemaField("post_abbr","STRING"),],)body=io.BytesIO(b"Washington,WA")client.load_table_from_file(body,table_id,job_config=job_config).result()previous_rows=client.get_table(table_id).num_rowsassertprevious_rows >0job_config=bigquery.LoadJobConfig(write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,source_format=bigquery.SourceFormat.PARQUET,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{} rows.".format(destination_table.num_rows))Loading hive-partitioned Parquet data
BigQuery supports loading hive partitioned Parquet data stored onCloud Storage and populates the hive partitioning columns as columns inthe destination BigQuery managed table. For more information, seeLoading externally partitioned data.
Parquet conversions
This section describes how BigQuery parses various data types when loading Parquet data.
Some Parquet data types (such asINT32,INT64,BYTE_ARRAY, andFIXED_LEN_BYTE_ARRAY) can be converted into multiple BigQuery data types. To ensure BigQuery converts the Parquet data types correctly, specify the appropriate data type in the Parquet file.
For example, to convert the ParquetINT32 data type to the BigQueryDATE data type, specify the following:
optional int32 date_col (DATE);
BigQuery converts Parquet data types to theBigQuery data types that are described in the following sections.
Type conversions
| BigQuery data type | ||
|---|---|---|
BOOLEAN | None | BOOLEAN |
| INT32 | None,INTEGER (UINT_8,UINT_16,UINT_32,INT_8,INT_16,INT_32) | INT64 |
| INT32 | DECIMAL | NUMERIC, BIGNUMERIC, or STRING |
INT32 | DATE | DATE |
INT64 | None,INTEGER (UINT_64,INT_64) | INT64 |
| INT64 | DECIMAL | NUMERIC, BIGNUMERIC, or STRING |
INT64 | TIMESTAMP,precision=MILLIS (TIMESTAMP_MILLIS) | TIMESTAMP |
INT64 | TIMESTAMP,precision=MICROS (TIMESTAMP_MICROS) | TIMESTAMP |
INT96 | None | TIMESTAMP |
FLOAT | None | FLOAT64 |
DOUBLE | None | FLOAT64 |
BYTE_ARRAY | None | BYTES |
BYTE_ARRAY | STRING (UTF8) | STRING |
| FIXED_LEN_BYTE_ARRAY | DECIMAL | NUMERIC, BIGNUMERIC, or STRING |
FIXED_LEN_BYTE_ARRAY | None | BYTES |
Nested groups are converted intoSTRUCT types.Other combinations of Parquet types and converted types are not supported.
Unsigned logical types
The ParquetUINT_8,UINT_16,UINT_32, andUINT_64 types are unsigned.BigQuery will treat values with these types as unsigned when loading into aBigQuery signedINTEGER column. In the case ofUINT_64, an error will be returnedif the unsigned value exceeds the maximumINTEGER value of9,223,372,036,854,775,807.
Decimal logical type
Decimal logical types can be converted toNUMERIC,BIGNUMERIC, orSTRING types. The converted type dependson the precision and scale parameters of thedecimal logical type and thespecified decimal target types. Specify the decimal target type as follows:
- For aload job using the
jobs.insertAPI: use theJobConfigurationLoad.decimalTargetTypesfield. - For aload job using the
bq loadcommand in the bq command-line tool:use the--decimal_target_typesflag. - For a query against atable with external sources:use the
ExternalDataConfiguration.decimalTargetTypesfield. - For apersistent external table created with DDL:use the
decimal_target_typesoption.
Enum logical type
Enum logical types can be converted toSTRING orBYTES. Specify the converted target type as follows:
- For aload job using the
jobs.insertAPI: use theJobConfigurationLoad.parquetOptionsfield. - For aload job using the
bq loadcommand in the bq command-line tool:use the--parquet_enum_as_stringflag. - For a persistent external table created with
bq mk:use the--parquet_enum_as_stringflag.
List logical type
You can enable schema inference for ParquetLIST logical types. BigQuerychecks whether theLIST node is in thestandard form or in one of the forms described by thebackward-compatibility rules:
// standard form<optional|required>group<name>(LIST){repeatedgrouplist{<optional|required><element-type>element;}}If yes, the corresponding field for theLIST node in the converted schema is treatedas if the node has the following schema:
repeated<element-type><name>The nodes "list" and "element" are omitted.
- For aload job using the
jobs.insertAPI, use theJobConfigurationLoad.parquetOptionsfield. - For aload job using the
bq loadcommand in the bq command-line tool,use the--parquet_enable_list_inferenceflag. - For a persistent external table created with
bq mk,use the--parquet_enable_list_inferenceflag. - For a persistent external table created with the
CREATE EXTERNAL TABLEstatement,use theenable_list_inferenceoption.
Geospatial data
You can load Parquet files that containWKT, hex-encodedWKB, orGeoJSON in aSTRING column, orWKB in aBYTE_ARRAY column by specifying aBigQuery schema with the typeGEOGRAPHY. For more information,seeLoading geospatial data.
You can also loadGeoParquet files. In this case, thecolumns described by the GeoParquet metadata are interpreted as typeGEOGRAPHYby default. You can also load the raw WKB data into aBYTES column byproviding an explicit schema. For more information, seeLoading GeoParquetfiles.
Column name conversions
A column name can contain letters (a-z, A-Z), numbers (0-9), or underscores(_), and it must start with a letter or underscore. If you use flexible columnnames, BigQuery supports starting a column name with a number.Exercise caution when starting columns with a number, since using flexiblecolumn names with the BigQuery Storage Read API orBigQuery Storage Write API requires special handling. For more information aboutflexible column name support, seeflexible column names.
Column names have a maximum length of 300 characters. Column names can't use anyof the following prefixes:
_TABLE__FILE__PARTITION_ROW_TIMESTAMP__ROOT___COLIDENTIFIER_CHANGE_SEQUENCE_NUMBER_CHANGE_TYPE_CHANGE_TIMESTAMP
Duplicate column names are not allowed even if the case differs. For example, acolumn namedColumn1 is considered identical to a column namedcolumn1. Tolearn more about column naming rules, seeColumnnames in theGoogleSQL reference.
If a table name (for example,test) is the same as one of its column names(for example,test), theSELECT expression interprets thetest column asaSTRUCT containing all other table columns. To avoid this collision, useone of the following methods:
Avoid using the same name for a table and its columns.
Avoid using
_field_as a column name prefix. System-reserved prefixescause automatic renaming during queries. For example, theSELECT _field_ FROM project1.dataset.testquery returns a column named_field_1. If you must query a column with this name, use an alias tocontrol the output.Assign the table a different alias. For example, the following query assignsa table alias
tto the tableproject1.dataset.test:SELECTtestFROMproject1.dataset.testASt;Include the table name when referencing a column. For example:
SELECTtest.testFROMproject1.dataset.test;
Flexible column names
You have more flexibility in what you name columns, including expanded accessto characters in languages other than English as well as additional symbols.Make sure to use backtick (`) characters to enclose flexible column names if they areQuoted Identifiers.
Flexible column names support the following characters:
- Any letter in any language, as represented by the Unicode regular expression
\p{L}. - Any numeric character in any language as represented by the Unicode regularexpression
\p{N}. - Any connector punctuation character, including underscores, as representedby the Unicode regular expression
\p{Pc}. - A hyphen or dash as represented by the Unicode regular expression
\p{Pd}. - Any mark intended to accompany another character as represented by theUnicode regular expression
\p{M}.For example, accents, umlauts, or enclosing boxes. - The following special characters:
- An ampersand (
&) as represented by the Unicode regularexpression\u0026. - A percent sign (
%) as represented by the Unicode regularexpression\u0025. - An equals sign (
=) as represented by the Unicode regularexpression\u003D. - A plus sign (
+) as represented by the Unicode regularexpression\u002B. - A colon (
:) as represented by the Unicode regularexpression\u003A. - An apostrophe (
') as represented by the Unicode regularexpression\u0027. - A less-than sign (
<) as represented by the Unicode regularexpression\u003C. - A greater-than sign (
>) as represented by the Unicode regularexpression\u003E. - A number sign (
#) as represented by the Unicode regularexpression\u0023. - A vertical line (
|) as represented by the Unicode regularexpression\u007c. - Whitespace.
- An ampersand (
Flexible column names don't support the following special characters:
- An exclamation mark (
!) as represented by the Unicode regularexpression\u0021. - A quotation mark (
") as represented by the Unicode regularexpression\u0022. - A dollar sign (
$) as represented by the Unicode regularexpression\u0024. - A left parenthesis (
() as represented by the Unicode regularexpression\u0028. - A right parenthesis (
)) as represented by the Unicode regularexpression\u0029. - An asterisk (
*) as represented by the Unicode regularexpression\u002A. - A comma (
,) as represented by the Unicode regularexpression\u002C. - A period (
.) as represented by the Unicode regularexpression\u002E. Periods arenot replaced by underscores in Parquet filecolumn names when a column name character map is used. For more information,seeflexible column limitations. - A slash (
/) as represented by the Unicode regularexpression\u002F. - A semicolon (
;) as represented by the Unicode regularexpression\u003B. - A question mark (
?) as represented by the Unicode regularexpression\u003F. - An at sign (
@) as represented by the Unicode regularexpression\u0040. - A left square bracket (
[) as represented by the Unicode regularexpression\u005B. - A backslash (
\) as represented by the Unicode regularexpression\u005C. - A right square bracket (
]) as represented by the Unicode regularexpression\u005D. - A circumflex accent (
^) as represented by the Unicode regularexpression\u005E. - A grave accent (
`) as represented by the Unicode regularexpression\u0060. - A left curly bracket {
{) as represented by the Unicode regularexpression\u007B. - A right curly bracket (
}) as represented by the Unicode regularexpression\u007D. - A tilde (
~) as represented by the Unicode regular expression\u007E.
For additional guidelines, seeColumn names.
The expanded column characters are supported by both the BigQuery Storage Read APIand the BigQuery Storage Write API. To use the expanded list of Unicode characterswith the BigQuery Storage Read API, you must set a flag. You can use thedisplayName attribute to retrieve the column name. The following exampleshows how to set a flag with the Python client:
fromgoogle.cloud.bigquery_storageimporttypesrequested_session=types.ReadSession()#set avro serialization options for flexible column.options=types.AvroSerializationOptions()options.enable_display_name_attribute=Truerequested_session.read_options.avro_serialization_options=optionsTo use the expanded list of Unicode characters with the BigQuery Storage Write API,you must provide the schema withcolumn_name notation, unless you are usingtheJsonStreamWriter writer object. The following example shows how toprovide the schema:
syntax="proto2";packagemypackage;// Source protos located in github.com/googleapis/googleapisimport"google/cloud/bigquery/storage/v1/annotations.proto";messageFlexibleSchema{optionalstringitem_name_column=1[(.google.cloud.bigquery.storage.v1.column_name)="name-列"];optionalstringitem_description_column=2[(.google.cloud.bigquery.storage.v1.column_name)="description-列"];}In this example,item_name_column anditem_description_column areplaceholder names which need to be compliant with theprotocol buffer namingconvention. Note thatcolumn_name annotations always take precedence overplaceholder names.
Limitations
Flexible column names are not supported withexternal tables.
You cannot load Parquet files containing columns that have a period(.) in the column name.
Column names from Parquet files are treated as case-insensitive when loadedinto BigQuery. Identical case-insensitive names will causecollisions. To avoid this, either append an underscore to one of the duplicatecolumn names or rename the columns before loading.
Debugging your Parquet file
If your load jobs fail with data errors, you can usePyArrow to verify if yourParquet data files are corrupted. If PyArrow fails to read the files, the filesare likely to be rejected by the BigQuery load job. The followingexample shows how to read the contents of a Parquet file by using PyArrow:
frompyarrowimportparquetaspq# Read the entire filepq.read_table('your_sample_file.parquet')# Read specific columnspq.read_table('your_sample_file.parquet',columns=['some_column','another_column'])# Read the metadata of specific columnsfile_metadata=pq.read_metadata('your_sample_file.parquet')forcolinfile_metadata.row_group(0).to_dict()['columns']:printcol['column_path_in_schema']printcol['num_values']For more information, see thePyArrow docs.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.