Querying clustered tables
When you create a clustered table in BigQuery, the table data isautomatically organized based on the contents of one or more columns inthe table's schema. The columns you specify are used to colocate related data.When you cluster a table using multiple columns, the order of columns youspecify is important. The order of the specified columns determines the sortorder of the data.
To optimize performance when you run queries against clustered tables, use anexpression that filters on a clustered column or on multiple clustered columnsin the order the clustered columns are specified. Queries that filter onclustered columns generally perform better than queries that filter only onnon-clustered columns.
BigQuery sorts the data in a clustered table based on the valuesin the clustering columns and organizes them into blocks.
When you submit a query that contains a filter on a clustered column,BigQuery uses the clustering information to efficiently determinewhether a block contains any data relevant to the query. Thisallows BigQuery to only scan the relevant blocks — a processreferred to asblock pruning.
You can query clustered tables by:
- Using the Google Cloud console
- Using the bq command-line tool's
bq querycommand - Calling the
jobs.insertmethodand configuring aquery job - Using the client libraries
You can only useGoogleSQLwith clustered tables.
Go
Before trying this sample, follow theGo setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryGo API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.
import("context""fmt""io""cloud.google.com/go/bigquery""google.golang.org/api/iterator")// queryClusteredTable demonstrates querying a table that has a clustering specification.funcqueryClusteredTable(wio.Writer,projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()q:=client.Query(fmt.Sprintf(`SELECT COUNT(1) as transactions, SUM(amount) as total_paid, COUNT(DISTINCT destination) as distinct_recipients FROM `+"`%s.%s`"+` WHERE timestamp > TIMESTAMP('2015-01-01')AND origin = @wallet`,datasetID,tableID))q.Parameters=[]bigquery.QueryParameter{{Name:"wallet",Value:"wallet00001866cb7e0f09a890",},}// Run the query and print results when the query job is completed.job,err:=q.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}iferr:=status.Err();err!=nil{returnerr}it,err:=job.Read(ctx)for{varrow[]bigquery.Valueerr:=it.Next(&row)iferr==iterator.Done{break}iferr!=nil{returnerr}fmt.Fprintln(w,row)}returnnil}Java
Before trying this sample, follow theJava setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryJava API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.QueryJobConfiguration;importcom.google.cloud.bigquery.TableResult;publicclassQueryClusteredTable{publicstaticvoidrunQueryClusteredTable()throwsException{// TODO(developer): Replace these variables before running the sample.StringprojectId="MY_PROJECT_ID";StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";queryClusteredTable(projectId,datasetName,tableName);}publicstaticvoidqueryClusteredTable(StringprojectId,StringdatasetName,StringtableName){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();StringsourceTable="`"+projectId+"."+datasetName+"."+tableName+"`";Stringquery="SELECT word, word_count\n"+"FROM "+sourceTable+"\n"// Optimize query performance by filtering the clustered columns in sort order+"WHERE corpus = 'romeoandjuliet'\n"+"AND word_count >= 1";QueryJobConfigurationqueryConfig=QueryJobConfiguration.newBuilder(query).build();TableResultresults=bigquery.query(queryConfig);results.iterateAll().forEach(row->row.forEach(val->System.out.printf("%s,",val.toString())));System.out.println("Query clustered table performed successfully.");}catch(BigQueryException|InterruptedExceptione){System.out.println("Query not performed \n"+e.toString());}}}Python
Before trying this sample, follow thePython setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQueryPython API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.
fromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the destination table.# table_id = "your-project.your_dataset.your_table_name"sql="SELECT * FROM `bigquery-public-data.samples.shakespeare`"cluster_fields=["corpus"]job_config=bigquery.QueryJobConfig(clustering_fields=cluster_fields,destination=table_id)# Start the query, passing in the extra configuration.client.query_and_wait(sql,job_config=job_config)# Make an API request and wait for job to complete.table=client.get_table(table_id)# Make an API request.iftable.clustering_fields==cluster_fields:print("The destination table is written using the cluster_fields configuration.")Required permissions
To run a queryjob, you need thebigquery.jobs.create Identity and Access Management (IAM) permission on the project thatruns the query job.
Each of the following predefined IAM roles includes thepermissions that you need to run a query job:
roles/bigquery.adminroles/bigquery.jobUserroles/bigquery.user
You also need thebigquery.tables.getData permissionon all tables and views that your query references. In addition, when queryinga view you need this permission on all underlying tables and views.However, if you are usingauthorized viewsorauthorized datasets, you don't needaccess to the underlying source data.
Each of the following predefined IAM roles includes thepermission that you need on all tables and views that the query references:
roles/bigquery.adminroles/bigquery.dataOwnerroles/bigquery.dataEditorroles/bigquery.dataViewer
For more information about IAM roles in BigQuery,seePredefined roles and permissions.
Best practices
To get the best performance from queries against clustered tables, use thefollowing best practices.
For context, the sample table used in the best practice examples is aclustered table that is created by using a DDL statement. The DDL statementcreates a table namedClusteredSalesData. The table is clustered by thefollowing columns:customer_id,product_id,order_id, in that sort order.
CREATETABLE`mydataset.ClusteredSalesData`PARTITIONBYDATE(timestamp)CLUSTERBYcustomer_id,product_id,order_idASSELECT*FROM`mydataset.SalesData`
Filter clustered columns by sort order
When you specify a filter, use expressions that filter on the clustered columnsin sort order. Sort order is the column order given in theCLUSTER BY clause.To get the benefits of clustering, include one or more of the clusteredcolumns in left-to-right sort order, starting with the first column. In mostcases, the first clustering column is the most effective in blockpruning, then the second column, then the third. You can still use the secondor third column alone in the query, but block pruning probablywon't be as effective. The ordering of the column names inside the filterexpression doesn't affect performance.
The following example queries theClusteredSalesData clustered tablethat was created in the preceding example. The query includes a filterexpression that filters oncustomer_id and then onproduct_id. This queryoptimizes performance by filtering the clustered columns insortorder—the column order given in theCLUSTER BY clause.
SELECTSUM(totalSale)FROM`mydataset.ClusteredSalesData`WHEREcustomer_id=10000ANDproduct_idLIKE'gcp_analytics%'
The following query does not filter the clustered columns in sort order. As aresult, the performance of the query is not optimal. This query filters onproduct_id then onorder_id (skippingcustomer_id).
SELECTSUM(totalSale)FROM`mydataset.ClusteredSalesData`WHEREproduct_idLIKE'gcp_analytics%'ANDorder_id=20000
Don't use clustered columns in complex filter expressions
If you use a clustered column in a complex filter expression, the performance ofthe query is not optimized because block pruning cannot be applied.
For example, the following query won't prune blocks because a clusteredcolumn—customer_id—is used in a function in the filterexpression.
SELECTSUM(totalSale)FROM`mydataset.ClusteredSalesData`WHERECAST(customer_idASSTRING)="10000"
To optimize query performance by pruning blocks, use simple filter expressionslike the following. In this example, a simple filter is applied to theclustered column—customer_id.
SELECTSUM(totalSale)FROM`mydataset.ClusteredSalesData`WHEREcustomer_id=10000
Don't compare clustered columns to other columns
If a filter expression compares a clustered column to another column (either aclustered column or a non-clustered column), the performance of the query is notoptimized because block pruning cannot be applied.
The following query does not prune blocks because the filter expression comparesa clustered column—customer_id to another column—order_id.
SELECTSUM(totalSale)FROM`mydataset.ClusteredSalesData`WHEREcustomer_id=order_id
Table security
To control access to tables in BigQuery, seeControl access to resources with IAM.
What's next
- For more information on running queries, seeRunning interactive and batch queries.
- To learn how to create and use clustered tables, seeCreating and using clustered tables.
- For an overview of partitioned table support in BigQuery, seeIntroduction to partitioned tables.
- To learn how to create partitioned tables, seeCreating partitioned tables.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.