Loading Parquet data from Cloud Storage

This page provides an overview of loading Parquet data from Cloud Storageinto BigQuery.

Parquet is anopen source column-oriented data format that is widely used in theApache Hadoop ecosystem.

When you load Parquet data from Cloud Storage, you can load the data into anew table or partition, or you can append to or overwrite an existing table orpartition. When your data is loaded into BigQuery, it isconverted into columnar format forCapacitor(BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table,the dataset that contains the table must be in the same regional or multi-regional location as the Cloud Storage bucket.

For information about loading Parquet data from a local file, seeLoading data from local files.

Limitations

You are subject to the following limitations when you load data intoBigQuery from a Cloud Storage bucket:

BigQuery does not guarantee data consistency for external datasources. Changes to the underlying data while a query is running can result inunexpected behavior.
BigQuery doesn't supportCloud Storage object versioning. If youinclude a generation number in the Cloud Storage URI, then the load jobfails.
You can't use a wildcard in the Cloud Storage URI if any of thefiles to be loaded have different schemas. Any difference in the position ofcolumns qualifies as a different schema.

Input file requirements

To avoidresourcesExceeded errors when loading Parquet files intoBigQuery, follow these guidelines:

Keep row sizes to 50 MB or less.
If your input data contains more than 100 columns, consider reducing the pagesize to be smaller than the default page size (1 * 1024 * 1024 bytes). Thisis especially helpful if you are using significant compression.
For optimal performance, aim for row group sizes of at least 16 MiB.Smaller row group sizes increase I/O and slow down loads and queries.

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessarypermissions to perform each task in this document, and create a datasetto store your data.

Required permissions

To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.

Permissions to load data into BigQuery

To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:

bigquery.tables.create
bigquery.tables.updateData
bigquery.tables.update
bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:

roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin (includes thebigquery.jobs.create permission)
bigquery.user (includes thebigquery.jobs.create permission)
bigquery.jobUser (includes thebigquery.jobs.create permission)

Additionally, if you have thebigquery.datasets.create permission, you can create andupdate tables using a load job in the datasets that you create.

For more information on IAM roles and permissions inBigQuery, seePredefined roles and permissions.

Permissions to load data from Cloud Storage

To get the permissions that you need to load data from a Cloud Storage bucket, ask your administrator to grant you theStorage Admin (roles/storage.admin) IAM role on the bucket. For more information about granting roles, seeManage access to projects, folders, and organizations.

This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand theRequired permissions section:

Required permissions

The following permissions are required to load data from a Cloud Storage bucket:

storage.buckets.get
storage.objects.get
storage.objects.list (required if you are using a URIwildcard)

You might also be able to get these permissions withcustom roles or otherpredefined roles.

Create a dataset

Create aBigQuery dataset to storeyour data.

Parquet schemas

When you load Parquet files into BigQuery, the table schema isautomatically retrieved from the self-describing source data. WhenBigQuery retrieves the schema from the source data, thealphabetically last file is used.

For example, you have the following Parquet files in Cloud Storage:

gs://mybucket/00/  a.parquet  z.parquetgs://mybucket/01/  b.parquet

Running this command in the bq command-line tool loads all of the files (as acomma-separated list), and the schema is derived frommybucket/01/b.parquet:

bqload\--source_format=PARQUET\dataset.table\"gs://mybucket/00/*.parquet","gs://mybucket/01/*.parquet"

When you load multiple Parquet files that have different schemas, identicalcolumns specified in multiple schemas must have the samemodein each schema definition.

When BigQuery detects the schema, some Parquet data types areconverted to BigQuery data types to make them compatible withGoogleSQL syntax. For more information, seeParquet conversions.

To provide a table schema for creating external tables, set thereferenceFileSchemaUriproperty in BigQuery API or
--reference_file_schema_uri parameter in bq command-line toolto the URL of the reference file.

For example,--reference_file_schema_uri="gs://mybucket/schema.parquet".

Parquet compression

BigQuery supports the following compression codecs forParquet file contents:

GZip
LZO_1C
LZO_1X
LZ4_RAW
Snappy
ZSTD

Loading Parquet data into a new table

You can load Parquet data into a new table by using one of the following:

The Google Cloud console
The bq command-line tool'sbq load command
Thejobs.insert API method and configuring aload job
The client libraries

To load Parquet data from Cloud Storage into a new BigQuerytable:

Console

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery
In the left pane, clickExplorer.
In theExplorer pane, expand your project, clickDatasets, and then select a dataset.
In theDataset info section, clickCreate table.
In theCreate table pane, specify the following details:

In theSource section, selectGoogle Cloud Storage in theCreate table from list. Then, do the following:
1. Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
2. ForFile format, selectParquet.
In theDestination section, specify the following details:
1. ForDataset, select the dataset in which you want to create the table.
2. In theTable field, enter the name of the table that you want to create.
3. Verify that theTable type field is set toNative table.
In theSchema section, no action is necessary. The schema is self-described in Parquet files.
Optional: SpecifyPartition and cluster settings. For more information, seeCreating partitioned tables andCreating and using clustered tables.
ClickAdvanced options and do the following:
- ForWrite preference, leaveWrite if empty selected. This option creates a new table and loads your data into it.
- If you want to ignore values in a row that are not present in the table's schema, then selectUnknown values.
- ForEncryption, clickCustomer-managed key to use aCloud Key Management Service key. If you leave theGoogle-managed key setting, BigQueryencrypts the data at rest.
ClickCreate table.

Note: When you load data into an empty table by using theGoogle Cloud console, you cannot add a label, description, tableexpiration, or partition expiration.

After the table is created, you can update the table's expiration,description, and labels, but you cannot add a partition expiration after atable is created using the Google Cloud console. For more information, see Managing tables.

SQL

Use theLOAD DATA DDL statement.The following example loads a Parquet file into the new tablemytable:

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery

In the query editor, enter the following statement:

LOADDATAOVERWRITEmydataset.mytableFROMFILES(format='PARQUET',uris=['gs://bucket/path/file.parquet']);

ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

Use thebq load command, specifyPARQUET using the--source_formatflag, and include aCloud Storage URI.You can include a single URI, a comma-separated list of URIs, or a URIcontaining awildcard.

(Optional) Supply the--location flag and set the value to yourlocation.

Other optional flags include:

--time_partitioning_type: Enables time-based partitioning on a tableand sets the partition type. Possible values areHOUR,DAY,MONTH,andYEAR. This flag is optional when you create atable partitioned on aDATE,DATETIME, orTIMESTAMP column. Thedefault partition type for time-based partitioning isDAY. You cannotchange the partitioning specification on an existing table.
--time_partitioning_expiration: An integer that specifies (in seconds)when a time-based partition should be deleted. The expiration timeevaluates to the partition's UTC date plus the integer value.
--time_partitioning_field: TheDATE orTIMESTAMP column used tocreate a partitioned table. If time-based partitioning is enabled withoutthis value, an ingestion-time partitioned table is created.
--require_partition_filter: When enabled, this option requires usersto include aWHERE clause that specifies the partitions to query.Requiring a partition filter can reduce cost and improve performance.For more information, seeRequire a partition filter in queries.
--clustering_fields: A comma-separated list of up to four column namesused to create aclustered table.
--destination_kms_key: The Cloud KMS key for encryption of thetable data.
--column_name_character_map: Defines the scope and handling ofcharacters in column names, with the option of enablingflexible column names.For more information, seeload_option_list.For more information on supported and unsupported characters, seeflexible column names.
For more information on partitioned tables, see:
- Creating partitioned tables
For more information on clustered tables, see:
- Creating and using clustered tables
For more information on table encryption, see:
- Protecting data with Cloud KMS keys

To load Parquet data into BigQuery, enter the following command:

bq--location=LOCATIONload\--source_format=FORMAT\DATASET.TABLE\PATH_TO_SOURCE

Replace the following:

LOCATION: your location. The--location flag isoptional. For example, if you are using BigQuery in theTokyo region, you can set the flag's value toasia-northeast1. You canset a default value for the location using the.bigqueryrc file.
FORMAT:PARQUET.
DATASET: an existing dataset.
TABLE: the name of the table into which you'reloading data.
PATH_TO_SOURCE: a fully qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.

Examples:

The following command loads data fromgs://mybucket/mydata.parquet into atable namedmytable inmydataset.

bqload\--source_format=PARQUET\mydataset.mytable\gs://mybucket/mydata.parquet

The following command loads data fromgs://mybucket/mydata.parquet into anew ingestion-time partitioned table namedmytable inmydataset.

bqload\--source_format=PARQUET\--time_partitioning_type=DAY\mydataset.mytable\gs://mybucket/mydata.parquet

The following command loads data fromgs://mybucket/mydata.parquet into apartitioned table namedmytable inmydataset. The table is partitionedon themytimestamp column.

bqload\--source_format=PARQUET\--time_partitioning_fieldmytimestamp\mydataset.mytable\gs://mybucket/mydata.parquet

The following command loads data from multiple files ings://mybucket/into a table namedmytable inmydataset. The Cloud Storage URI uses awildcard.

bqload\--source_format=PARQUET\mydataset.mytable\gs://mybucket/mydata*.parquet

The following command loads data from multiple files ings://mybucket/into a table namedmytable inmydataset. The command includes a comma-separated list of Cloud Storage URIs with wildcards.

bqload\--source_format=PARQUET\mydataset.mytable\"gs://mybucket/00/*.parquet","gs://mybucket/01/*.parquet"

API

Create aload job that points to the source data in Cloud Storage.
(Optional) Specify yourlocation inthelocation property in thejobReference section of thejob resource.
Thesource URIs property must be fully qualified, in the formatgs://BUCKET/OBJECT.Each URI can contain one '*'wildcard character.
Specify the Parquet data format by setting thesourceFormat property toPARQUET.
To check the job status, calljobs.get(JOB_ID*),replacingJOB_ID with the ID of the job returned by the initialrequest.
- Ifstatus.state = DONE, the job completed successfully.
- If thestatus.errorResult property is present, the request failed,and that object includes information describing what went wrong.When a request fails, no table is created and no data is loaded.
- Ifstatus.errorResult is absent, the job finished successfully;although, there might have been some nonfatal errors, such as problemsimporting a few rows. Nonfatal errors are listed in the returned jobobject'sstatus.errors property.

API notes:

Load jobs are atomic and consistent: if a load job fails, none of the datais available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it asjobReference.jobId when callingjobs.insert to create a load job. Thisapproach is more robust to network failure because the client can poll orretry on the known job ID.
Callingjobs.insert on a given job ID is idempotent. You can retry asmany times as you like on the same job ID, and at most one of thoseoperations will succeed.

Go

Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

import("context""fmt""cloud.google.com/go/bigquery")// importParquet demonstrates loading Apache Parquet data from Cloud Storage into a table.funcimportParquet(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.parquet")gcsRef.SourceFormat=bigquery.ParquetgcsRef.AutoDetect=trueloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)job,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}

Java

Before trying this sample, follow theJava setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryJava API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;importjava.math.BigInteger;publicclassLoadParquet{publicstaticvoidrunLoadParquet(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";loadParquet(datasetName);}publicstaticvoidloadParquet(StringdatasetName){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet";TableIdtableId=TableId.of(datasetName,"us_states");LoadJobConfigurationconfiguration=LoadJobConfiguration.builder(tableId,sourceUri).setFormatOptions(FormatOptions.parquet()).build();// For more information on Job see:// https://googleapis.dev/java/google-cloud-clients/latest/index.html?com/google/cloud/bigquery/package-summary.html// Load the tableJobjob=bigquery.create(JobInfo.of(configuration));// Blocks until this load table job completes its execution, either failing or succeeding.JobcompletedJob=job.waitFor();if(completedJob==null){System.out.println("Job not executed since it no longer exists.");return;}elseif(completedJob.getStatus().getError()!=null){System.out.println("BigQuery was unable to load the table due to an error: \n"+job.getStatus().getError());return;}// Check number of rows loaded into the tableBigIntegernumRows=bigquery.getTable(tableId).getNumRows();System.out.printf("Loaded %d rows. \n",numRows);System.out.println("GCS parquet loaded successfully.");}catch(BigQueryException|InterruptedExceptione){System.out.println("GCS Parquet was not loaded. \n"+e.toString());}}}

Node.js

Before trying this sample, follow theNode.js setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryNode.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/** * This sample loads the Parquet file at * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.parquet * * TODO(developer): Replace the following lines with the path to your file. */constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.parquet';asyncfunctionloadTableGCSParquet(){// Imports a GCS file into a table with Parquet source format./**   * TODO(developer): Uncomment the following lines before running the sample.   */// const datasetId = 'my_dataset';// const tableId = 'my_table';// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'PARQUET',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id} completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors &&errors.length >0){throwerrors;}}

PHP

Before trying this sample, follow thePHP setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPHP API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId  = 'The Google project ID';// $datasetId  = 'The BigQuery dataset ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient([    'projectId' => $projectId,]);$dataset = $bigQuery->dataset($datasetId);$table = $dataset->table('us_states');// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('PARQUET');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) {    print('Waiting for job to complete' . PHP_EOL);    $job->reload();    if (!$job->isComplete()) {        throw new Exception('Job has not yet completed', 500);    }});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) {    $error = $job->info()['status']['errorResult']['message'];    printf('Error running job: %s' . PHP_EOL, $error);} else {    print('Data imported successfully' . PHP_EOL);}

Python

Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

Use theClient.load_table_from_uri()method to start a load job from Cloud Storage. To use Parquet,set theLoadJobConfig.source_formatpropertyto the stringPARQUET and pass the job config as thejob_config argument to theload_table_from_uri() method.

fromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_name"job_config=bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.PARQUET,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{} rows.".format(destination_table.num_rows))

Appending to or overwriting a table with Parquet data

You can load additional data into a table either from source files or byappending query results.

In the Google Cloud console, use theWrite preference option to specifywhat action to take when you load data from a source file or from a queryresult.

You have the following options when you load additional data into a table:

Console option	bq tool flag	BigQuery API property	Description
Write if empty	Not supported	`WRITE_EMPTY`	Writes the data only if the table is empty.
Append to table	`--noreplace` or`--replace=false`; if`--[no]replace` is unspecified, the default is append	`WRITE_APPEND`	(Default) Appends the data to the end of the table.
Overwrite table	`--replace` or`--replace=true`	`WRITE_TRUNCATE`	Erases all existing data in a table before writing the new data. This action also deletes the table schema, row level security, and removes any Cloud KMS key.

If you load data into an existing table, the load job can append the data oroverwrite the table.

You can append or overwrite a table by using one of the following:

The Google Cloud console
The bq command-line tool'sbq load command
Thejobs.insert API method and configuring aload job
The client libraries

Note: This page does not cover appending or overwriting partitioned tables. Forinformation on appending and overwriting partitioned tables, see:Appending to and overwriting partitioned table data.

To append or overwrite a table with Parquet data:

Console

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery
In the left pane, clickExplorer.
In theExplorer pane, expand your project, clickDatasets, and then select a dataset.
In theDataset info section, clickCreate table.
In theCreate table pane, specify the following details:

In theSource section, selectGoogle Cloud Storage in theCreate table from list. Then, do the following:
1. Select a file from the Cloud Storage bucket, or enter theCloud Storage URI. You cannot include multiple URIs in the Google Cloud console, butwildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
2. ForFile format, selectParquet.

Note: It is possible to modify the table's schema when you append or overwrite it. For more information about supported schema changes during a load operation, see

Modifying table schemas

In theDestination section, specify the following details:
1. ForDataset, select the dataset in which you want to create the table.
2. In theTable field, enter the name of the table that you want to create.
3. Verify that theTable type field is set toNative table.
In theSchema section, no action is necessary. The schema is self-described in Parquet files.

Note: It is possible to modify the table's schema when you append or overwrite it. For more information about supported schema changes during a load operation, see

Modifying table schemas

Optional: SpecifyPartition and cluster settings. For more information, seeCreating partitioned tables andCreating and using clustered tables. You cannot convert a table to a partitioned or clustered table by appending or overwriting it. The Google Cloud console does not support appending to or overwriting partitioned or clustered tables in a load job.
ClickAdvanced options and do the following:
- ForWrite preference, chooseAppend to table orOverwrite table.
- If you want to ignore values in a row that are not present in the table's schema, then selectUnknown values.
- ForEncryption, clickCustomer-managed key to use aCloud Key Management Service key. If you leave theGoogle-managed key setting, BigQueryencrypts the data at rest.
ClickCreate table.

SQL

Use theLOAD DATA DDL statement.The following example appends a Parquet file to the tablemytable:

In the Google Cloud console, go to theBigQuery page.
Go to BigQuery

In the query editor, enter the following statement:

LOADDATAINTOmydataset.mytableFROMFILES(format='PARQUET',uris=['gs://bucket/path/file.parquet']);

ClickRun.

For more information about how to run queries, seeRun an interactive query.

bq

Enter thebq load command with the--replace flag to overwrite thetable. Use the--noreplace flag to append data to the table. If no flag isspecified, the default is to append data. Supply the--source_format flagand set it toPARQUET. Because Parquet schemas are automatically retrievedfrom the self-describing source data, you don't need to provide a schemadefinition.

Note: It is possible to modify the table's schema when you append oroverwrite it. For more information on supported schema changes during a loadoperation, see Modifying table schemas.

(Optional) Supply the--location flag and set the value to yourlocation.

Other optional flags include:

--destination_kms_key: The Cloud KMS key for encryption of thetable data.

bq--location=LOCATIONload\--[no]replace\--source_format=FORMAT\DATASET.TABLE\PATH_TO_SOURCE

Replace the following:

location: yourlocation.The--location flag is optional. You can set a default value for thelocation by using the.bigqueryrc file.
format:PARQUET.
dataset: an existing dataset.
table: the name of the table into which you'reloading data.
path_to_source: a fully qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.

Examples:

The following command loads data fromgs://mybucket/mydata.parquet andoverwrites a table namedmytable inmydataset.

bqload\--replace\--source_format=PARQUET\mydataset.mytable\gs://mybucket/mydata.parquet

The following command loads data fromgs://mybucket/mydata.parquet andappends data to a table namedmytable inmydataset.

bqload\--noreplace\--source_format=PARQUET\mydataset.mytable\gs://mybucket/mydata.parquet

For information on appending and overwriting partitioned tables using thebq command-line tool, seeAppending to and overwriting partitioned table data.

API

Create aload job that points to the source data in Cloud Storage.
(Optional) Specify yourlocation inthelocation property in thejobReference section of thejob resource.
Thesource URIs propertymust be fully qualified, in the formatgs://BUCKET/OBJECT. You caninclude multiple URIs as a comma-separated list. Note thatwildcards arealso supported.
Specify the data format by setting theconfiguration.load.sourceFormat property toPARQUET.
Specify the write preference by setting theconfiguration.load.writeDisposition property toWRITE_TRUNCATE orWRITE_APPEND.

Go

Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

import("context""fmt""cloud.google.com/go/bigquery")// importParquetTruncate demonstrates loading Apache Parquet data from Cloud Storage into a table// and overwriting/truncating existing data in the table.funcimportParquetTruncate(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.parquet")gcsRef.SourceFormat=bigquery.ParquetgcsRef.AutoDetect=trueloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)loader.WriteDisposition=bigquery.WriteTruncatejob,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}

Java

Before trying this sample, follow theJava setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryJava API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.JobInfo.WriteDisposition;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;importjava.math.BigInteger;publicclassLoadParquetReplaceTable{publicstaticvoidrunLoadParquetReplaceTable(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";loadParquetReplaceTable(datasetName);}publicstaticvoidloadParquetReplaceTable(StringdatasetName){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();// Imports a GCS file into a table and overwrites table data if table already exists.// This sample loads CSV file at:// https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csvStringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet";TableIdtableId=TableId.of(datasetName,"us_states");// For more information on LoadJobConfiguration see:// https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/bigquery/LoadJobConfiguration.Builder.htmlLoadJobConfigurationconfiguration=LoadJobConfiguration.builder(tableId,sourceUri).setFormatOptions(FormatOptions.parquet())// Set the write disposition to overwrite existing table data..setWriteDisposition(WriteDisposition.WRITE_TRUNCATE).build();// For more information on Job see:// https://googleapis.dev/java/google-cloud-clients/latest/index.html?com/google/cloud/bigquery/package-summary.html// Load the tableJobjob=bigquery.create(JobInfo.of(configuration));// Load data from a GCS parquet file into the table// Blocks until this load table job completes its execution, either failing or succeeding.JobcompletedJob=job.waitFor();if(completedJob==null){System.out.println("Job not executed since it no longer exists.");return;}elseif(completedJob.getStatus().getError()!=null){System.out.println("BigQuery was unable to load into the table due to an error: \n"+job.getStatus().getError());return;}// Check number of rows loaded into the tableBigIntegernumRows=bigquery.getTable(tableId).getNumRows();System.out.printf("Loaded %d rows. \n",numRows);System.out.println("GCS parquet overwrote existing table successfully.");}catch(BigQueryException|InterruptedExceptione){System.out.println("Table extraction job was interrupted. \n"+e.toString());}}}

Node.js

Before trying this sample, follow theNode.js setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryNode.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/** * This sample loads the CSV file at * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv * * TODO(developer): Replace the following lines with the path to your file. */constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.parquet';asyncfunctionloadParquetFromGCSTruncate(){/**   * Imports a GCS file into a table and overwrites   * table data if table already exists.   *//**   * TODO(developer): Uncomment the following lines before running the sample.   */// const datasetId = "my_dataset";// const tableId = "my_table";// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'PARQUET',// Set the write disposition to overwrite existing table data.writeDisposition:'WRITE_TRUNCATE',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id} completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors &&errors.length >0){throwerrors;}}

PHP

Before trying this sample, follow thePHP setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPHP API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// $tableID = 'The BigQuery table ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient([    'projectId' => $projectId,]);$table = $bigQuery->dataset($datasetId)->table($tableId);// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('PARQUET')->writeDisposition('WRITE_TRUNCATE');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) {    print('Waiting for job to complete' . PHP_EOL);    $job->reload();    if (!$job->isComplete()) {        throw new Exception('Job has not yet completed', 500);    }});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) {    $error = $job->info()['status']['errorResult']['message'];    printf('Error running job: %s' . PHP_EOL, $error);} else {    print('Data imported successfully' . PHP_EOL);}

Python

Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

To append the rows to an existing table, settheLoadJobConfig.write_disposition propertytoWRITE_APPEND.

To replace the rows in an existing table, set theLoadJobConfig.write_disposition propertytoWRITE_TRUNCATE.

importiofromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(schema=[bigquery.SchemaField("name","STRING"),bigquery.SchemaField("post_abbr","STRING"),],)body=io.BytesIO(b"Washington,WA")client.load_table_from_file(body,table_id,job_config=job_config).result()previous_rows=client.get_table(table_id).num_rowsassertprevious_rows >0job_config=bigquery.LoadJobConfig(write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,source_format=bigquery.SourceFormat.PARQUET,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{} rows.".format(destination_table.num_rows))

Loading hive-partitioned Parquet data

BigQuery supports loading hive partitioned Parquet data stored onCloud Storage and populates the hive partitioning columns as columns inthe destination BigQuery managed table. For more information, seeLoading externally partitioned data.

Parquet conversions

This section describes how BigQuery parses various data types when loading Parquet data.

Some Parquet data types (such asINT32,INT64,BYTE_ARRAY, andFIXED_LEN_BYTE_ARRAY) can be converted into multiple BigQuery data types. To ensure BigQuery converts the Parquet data types correctly, specify the appropriate data type in the Parquet file.

For example, to convert the ParquetINT32 data type to the BigQueryDATE data type, specify the following:

optional int32 date_col (DATE);

BigQuery converts Parquet data types to theBigQuery data types that are described in the following sections.

Type conversions

Parquet type	Parquet logical type(s)	BigQuery data type
`BOOLEAN`	None	BOOLEAN
INT32	None,`INTEGER` (`UINT_8`,`UINT_16`,`UINT_32`,`INT_8`,`INT_16`,`INT_32`)	INT64
INT32	DECIMAL	NUMERIC, BIGNUMERIC, or STRING
`INT32`	`DATE`	DATE
`INT64`	None,`INTEGER` (`UINT_64`,`INT_64`)	INT64
INT64	DECIMAL	NUMERIC, BIGNUMERIC, or STRING
`INT64`	`TIMESTAMP`,`precision=MILLIS` (`TIMESTAMP_MILLIS`)	TIMESTAMP
`INT64`	`TIMESTAMP`,`precision=MICROS` (`TIMESTAMP_MICROS`)	TIMESTAMP
`INT96`	None	TIMESTAMP
`FLOAT`	None	FLOAT64
`DOUBLE`	None	FLOAT64
`BYTE_ARRAY`	None	BYTES
`BYTE_ARRAY`	`STRING` (`UTF8`)	STRING
FIXED_LEN_BYTE_ARRAY	DECIMAL	NUMERIC, BIGNUMERIC, or STRING
`FIXED_LEN_BYTE_ARRAY`	None	BYTES

Nested groups are converted intoSTRUCT types.Other combinations of Parquet types and converted types are not supported.

Unsigned logical types

The ParquetUINT_8,UINT_16,UINT_32, andUINT_64 types are unsigned.BigQuery will treat values with these types as unsigned when loading into aBigQuery signedINTEGER column. In the case ofUINT_64, an error will be returnedif the unsigned value exceeds the maximumINTEGER value of9,223,372,036,854,775,807.

Decimal logical type

Decimal logical types can be converted toNUMERIC,BIGNUMERIC, orSTRING types. The converted type dependson the precision and scale parameters of thedecimal logical type and thespecified decimal target types. Specify the decimal target type as follows:

For aload job using thejobs.insert API: use theJobConfigurationLoad.decimalTargetTypesfield.
For aload job using thebq load command in the bq command-line tool:use the--decimal_target_typesflag.
For a query against atable with external sources:use theExternalDataConfiguration.decimalTargetTypesfield.
For apersistent external table created with DDL:use thedecimal_target_types option.

Enum logical type

Enum logical types can be converted toSTRING orBYTES. Specify the converted target type as follows:

For aload job using thejobs.insert API: use theJobConfigurationLoad.parquetOptionsfield.
For aload job using thebq load command in the bq command-line tool:use the--parquet_enum_as_string flag.
For a persistent external table created withbq mk:use the--parquet_enum_as_string flag.

List logical type

You can enable schema inference for ParquetLIST logical types. BigQuerychecks whether theLIST node is in thestandard form or in one of the forms described by thebackward-compatibility rules:

// standard form<optional|required>group<name>(LIST){repeatedgrouplist{<optional|required><element-type>element;}}

If yes, the corresponding field for theLIST node in the converted schema is treatedas if the node has the following schema:

repeated<element-type><name>

The nodes "list" and "element" are omitted.

For aload job using thejobs.insert API, use theJobConfigurationLoad.parquetOptions field.
For aload job using thebq load command in the bq command-line tool,use the--parquet_enable_list_inference flag.
For a persistent external table created withbq mk,use the--parquet_enable_list_inference flag.
For a persistent external table created with theCREATE EXTERNAL TABLE statement,use theenable_list_inference option.

Geospatial data

You can load Parquet files that containWKT, hex-encodedWKB, orGeoJSON in aSTRING column, orWKB in aBYTE_ARRAY column by specifying aBigQuery schema with the typeGEOGRAPHY. For more information,seeLoading geospatial data.

You can also loadGeoParquet files. In this case, thecolumns described by the GeoParquet metadata are interpreted as typeGEOGRAPHYby default. You can also load the raw WKB data into aBYTES column byproviding an explicit schema. For more information, seeLoading GeoParquetfiles.

Column name conversions

A column name can contain letters (a-z, A-Z), numbers (0-9), or underscores(_), and it must start with a letter or underscore. If you use flexible columnnames, BigQuery supports starting a column name with a number.Exercise caution when starting columns with a number, since using flexiblecolumn names with the BigQuery Storage Read API orBigQuery Storage Write API requires special handling. For more information aboutflexible column name support, seeflexible column names.

Column names have a maximum length of 300 characters. Column names can't use anyof the following prefixes:

_TABLE_
_FILE_
_PARTITION
_ROW_TIMESTAMP
__ROOT__
_COLIDENTIFIER
_CHANGE_SEQUENCE_NUMBER
_CHANGE_TYPE
_CHANGE_TIMESTAMP

Duplicate column names are not allowed even if the case differs. For example, acolumn namedColumn1 is considered identical to a column namedcolumn1. Tolearn more about column naming rules, seeColumnnames in theGoogleSQL reference.

If a table name (for example,test) is the same as one of its column names(for example,test), theSELECT expression interprets thetest column asaSTRUCT containing all other table columns. To avoid this collision, useone of the following methods:

Avoid using the same name for a table and its columns.
Avoid using_field_ as a column name prefix. System-reserved prefixescause automatic renaming during queries. For example, theSELECT _field_ FROM project1.dataset.test query returns a column named_field_1. If you must query a column with this name, use an alias tocontrol the output.
Assign the table a different alias. For example, the following query assignsa table aliast to the tableproject1.dataset.test:
```
SELECTtestFROMproject1.dataset.testASt;
```
Include the table name when referencing a column. For example:
```
SELECTtest.testFROMproject1.dataset.test;
```

Flexible column names

You have more flexibility in what you name columns, including expanded accessto characters in languages other than English as well as additional symbols.Make sure to use backtick (`) characters to enclose flexible column names if they areQuoted Identifiers.

Flexible column names support the following characters:

Any letter in any language, as represented by the Unicode regular expression\p{L}.
Any numeric character in any language as represented by the Unicode regularexpression\p{N}.
Any connector punctuation character, including underscores, as representedby the Unicode regular expression\p{Pc}.
A hyphen or dash as represented by the Unicode regular expression\p{Pd}.
Any mark intended to accompany another character as represented by theUnicode regular expression\p{M}.For example, accents, umlauts, or enclosing boxes.
The following special characters:
- An ampersand (&) as represented by the Unicode regularexpression\u0026.
- A percent sign (%) as represented by the Unicode regularexpression\u0025.
- An equals sign (=) as represented by the Unicode regularexpression\u003D.
- A plus sign (+) as represented by the Unicode regularexpression\u002B.
- A colon (:) as represented by the Unicode regularexpression\u003A.
- An apostrophe (') as represented by the Unicode regularexpression\u0027.
- A less-than sign (<) as represented by the Unicode regularexpression\u003C.
- A greater-than sign (>) as represented by the Unicode regularexpression\u003E.
- A number sign (#) as represented by the Unicode regularexpression\u0023.
- A vertical line (|) as represented by the Unicode regularexpression\u007c.
- Whitespace.

Flexible column names don't support the following special characters:

An exclamation mark (!) as represented by the Unicode regularexpression\u0021.
A quotation mark (") as represented by the Unicode regularexpression\u0022.
A dollar sign ($) as represented by the Unicode regularexpression\u0024.
A left parenthesis (() as represented by the Unicode regularexpression\u0028.
A right parenthesis ()) as represented by the Unicode regularexpression\u0029.
An asterisk (*) as represented by the Unicode regularexpression\u002A.
A comma (,) as represented by the Unicode regularexpression\u002C.
A period (.) as represented by the Unicode regularexpression\u002E. Periods arenot replaced by underscores in Parquet filecolumn names when a column name character map is used. For more information,seeflexible column limitations.
A slash (/) as represented by the Unicode regularexpression\u002F.
A semicolon (;) as represented by the Unicode regularexpression\u003B.
A question mark (?) as represented by the Unicode regularexpression\u003F.
An at sign (@) as represented by the Unicode regularexpression\u0040.
A left square bracket ([) as represented by the Unicode regularexpression\u005B.
A backslash (\) as represented by the Unicode regularexpression\u005C.
A right square bracket (]) as represented by the Unicode regularexpression\u005D.
A circumflex accent (^) as represented by the Unicode regularexpression\u005E.
A grave accent (`) as represented by the Unicode regularexpression\u0060.
A left curly bracket {{) as represented by the Unicode regularexpression\u007B.
A right curly bracket (}) as represented by the Unicode regularexpression\u007D.
A tilde (~) as represented by the Unicode regular expression\u007E.

For additional guidelines, seeColumn names.

The expanded column characters are supported by both the BigQuery Storage Read APIand the BigQuery Storage Write API. To use the expanded list of Unicode characterswith the BigQuery Storage Read API, you must set a flag. You can use thedisplayName attribute to retrieve the column name. The following exampleshows how to set a flag with the Python client:

fromgoogle.cloud.bigquery_storageimporttypesrequested_session=types.ReadSession()#set avro serialization options for flexible column.options=types.AvroSerializationOptions()options.enable_display_name_attribute=Truerequested_session.read_options.avro_serialization_options=options

To use the expanded list of Unicode characters with the BigQuery Storage Write API,you must provide the schema withcolumn_name notation, unless you are usingtheJsonStreamWriter writer object. The following example shows how toprovide the schema:

syntax="proto2";packagemypackage;// Source protos located in github.com/googleapis/googleapisimport"google/cloud/bigquery/storage/v1/annotations.proto";messageFlexibleSchema{optionalstringitem_name_column=1[(.google.cloud.bigquery.storage.v1.column_name)="name-列"];optionalstringitem_description_column=2[(.google.cloud.bigquery.storage.v1.column_name)="description-列"];}

In this example,item_name_column anditem_description_column areplaceholder names which need to be compliant with theprotocol buffer namingconvention. Note thatcolumn_name annotations always take precedence overplaceholder names.

Limitations

Flexible column names are not supported withexternal tables.
You cannot load Parquet files containing columns that have a period(.) in the column name.
Column names from Parquet files are treated as case-insensitive when loadedinto BigQuery. Identical case-insensitive names will causecollisions. To avoid this, either append an underscore to one of the duplicatecolumn names or rename the columns before loading.

Debugging your Parquet file

If your load jobs fail with data errors, you can usePyArrow to verify if yourParquet data files are corrupted. If PyArrow fails to read the files, the filesare likely to be rejected by the BigQuery load job. The followingexample shows how to read the contents of a Parquet file by using PyArrow:

frompyarrowimportparquetaspq# Read the entire filepq.read_table('your_sample_file.parquet')# Read specific columnspq.read_table('your_sample_file.parquet',columns=['some_column','another_column'])# Read the metadata of specific columnsfile_metadata=pq.read_metadata('your_sample_file.parquet')forcolinfile_metadata.row_group(0).to_dict()['columns']:printcol['column_path_in_schema']printcol['num_values']

For more information, see thePyArrow docs.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Loading Parquet data from Cloud Storage

Limitations

Input file requirements

Before you begin

Required permissions

Permissions to load data into BigQuery

Permissions to load data from Cloud Storage

Required permissions

Create a dataset

Parquet schemas

Parquet compression

Loading Parquet data into a new table

Console

SQL

bq

API

Go

Java

Node.js

PHP

Python

Appending to or overwriting a table with Parquet data

Console

SQL

bq

API

Go

Java

Node.js

PHP

Python

Loading hive-partitioned Parquet data

Parquet conversions

Type conversions

Unsigned logical types

Decimal logical type

Enum logical type

List logical type

Geospatial data

Column name conversions

Flexible column names

Limitations

Debugging your Parquet file