Optimize query computation
This document provides the best practices for optimizing your query performance.
When you run a query, you canview the query planin the Google Cloud console. You can also request execution details by usingtheINFORMATION_SCHEMA.JOBS* viewsor thejobs.get REST API method.
The query plan includes details about query stages and steps. These details canhelp you identify ways to improve query performance. For example, if you noticea stage that writes a lot more output than other stages, it might mean that youneed to filter earlier in the query.
To learn more about the query plan and see examples of how the query planinformation can help you to improve query performance, seeGet query performance insights.After addressing the query performance insights, you can further optimize yourquery by performing the following tasks:
- Reduce data that is to be processed
- Optimize query operations
- Reduce the output of your query
- Avoid anti-SQL patterns
Reduce data processed
You can reduce data that needs to be processed by using the options described inthe following sections.
AvoidSELECT *
Best practice: Control projection by querying only the columns that youneed.
Projection refers to the number of columns that are read by your query.Projecting excess columns incurs additional (wasted) I/O and materialization(writing results).
- Use the data preview options. If you are experimenting with data or exploring data, use one of thedata preview options instead of
SELECT *. - Query specific columns. Applying a
LIMITclause to aSELECT *query does not affect the amount of data read. You are billed for reading all bytes in the entire table, and the query counts against your free tier quota. Instead, query only the columns you need. For example, useSELECT * EXCEPTto exclude one or more columns from the results. - Use partitioned tables. If you do require queries against every column in a table, but only against a subset of data, consider:
- Materializing results in a destination table and querying that table instead.
- Partitioning your tables andquerying the relevant partition. For example, use
WHERE _PARTITIONDATE="2017-01-01"to query only the January 1, 2017 partition.
Use
SELECT * EXCEPT. Querying a subset of data or usingSELECT * EXCEPTcan greatly reduce the amount of data that is read by a query. In addition tothe cost savings, performance is improved by reducing the amount of data I/Oand the amount of materialization that is required for the query results.SELECT*EXCEPT(col1,col2,col5)FROMmydataset.newtable
Avoid excessive wildcard tables
Best practice: When queryingwildcard tables,you must use the most granular prefix.
Use wildcards to query multiple tables by using concise SQL statements. Wildcardtables are a union of tables that match the wildcard expression. Wildcard tablesare useful if your dataset contains the following resources:
- Multiple, similarly named tables with compatible schemas
- Sharded tables
When you query a wildcard table, specify a wildcard (*) after the commontable prefix. For example,FROMqueries all tables from the 1940s.bigquery-public-data.noaa_gsod.gsod194*
More granular prefixes perform better than shorter prefixes. For example,FROM performs better thanbigquery-public-data.noaa_gsod.gsod194*FROM because fewer tables matchthe wildcard.bigquery-public-data.noaa_gsod.*
Avoid tables sharded by date
Best practice: Don't use tables sharded by date (also called date-namedtables) in place of time-partitioned tables.
Partitioned tables perform better thandate-named tables. When you create tables sharded by date,BigQuery must maintain a copy of the schema and metadata for eachdate-named table. Also, when date-named tables are used, BigQuerymight be required to verify permissions for each queried table. This practicealso adds to query overhead and impacts query performance.
Avoid oversharding tables
Best practice: Avoid creating too many table shards. If you are shardingtables by date, use time-partitioned tables instead.
Table sharding refers to dividing large datasets into separate tables and addinga suffix to each table name. If you are sharding tables by date, usetime-partitioned tables instead.
Because of the low cost of BigQuery storage, you don't need tooptimize your tables for cost as you would in a relational database system.Creating a large number of table shards has performance impacts that outweighany cost benefits.
Sharded tables require BigQuery to maintain schema, metadata,and permissions for each shard. Because of the added overhead required tomaintain information on each shard, oversharding tables can impact queryperformance.
The amount and source of data read by a query can impact queryperformance and cost.
Prune partitioned queries
Best practice: When querying apartitioned table,to filter with partitions on partitioned tables, use the following columns:
- For ingestion-time partitioned tables, use the pseudocolumn
_PARTITIONTIME - For partitioned tables such as the time-unit column-based and integer-range,use thepartitioning column.
For time-unit partitioned tables, filtering the data with_PARTITIONTIME orpartitioning column lets you specify a date or range of dates. For example,the followingWHERE clause uses the_PARTITIONTIME pseudocolumn to specifypartitions between January 1, 2016 and January 31, 2016:
WHERE_PARTITIONTIMEBETWEENTIMESTAMP("20160101")ANDTIMESTAMP("20160131")The query processes data only in the partitions that are indicated by the daterange. Filtering your partitions improves query performance and reduces costs.
Reduce data before using aJOIN
Best practice: Reduce the amount of data that is processed before aJOINclause by performing aggregations.
Using aGROUP BY clausewithaggregate functionsis computationally intensive, because these types of queries useshuffle.As these queries are computationally intensive, you must use aGROUP BYclause only when necessary.
For queries withGROUP BY andJOIN, perform aggregation earlier in the queryto reduce the amount of data processed.For example, the following query performs aJOIN on two large tables withoutany filtering beforehand:
WITHusers_postsAS(SELECT*FROM`bigquery-public-data`.stackoverflow.commentsAScJOIN`bigquery-public-data`.stackoverflow.usersASuONc.user_id=u.id)SELECTuser_id,ANY_VALUE(display_name)ASdisplay_name,ANY_VALUE(reputation)ASreputation,COUNT(text)AScomments_countFROMusers_postsGROUPBYuser_idORDERBYcomments_countDESCLIMIT20;
This query pre-aggregates the comment counts which reduces the amount of dataread for theJOIN:
WITHcommentsAS(SELECTuser_id,COUNT(text)AScomments_countFROM`bigquery-public-data`.stackoverflow.commentsWHEREuser_idISNOTNULLGROUPBYuser_idORDERBYcomments_countDESCLIMIT20)SELECTuser_id,display_name,reputation,comments_countFROMcommentsJOIN`bigquery-public-data`.stackoverflow.usersASuONuser_id=u.idORDERBYcomments_countDESC;
WITH clauses with common table expressions (CTEs) are usedfor query readability, not performance. There is no guarantee that addingaWITH clause causes BigQuery to materialize temporaryintermediate tables and reuse the temporary result for multiple references.TheWITH clause might be evaluated multiple times within a query, depending onquery optimizer decisions.Use theWHERE clause
Best practice: Use aWHERE clauseto limit the amount of data a query returns. When possible, useBOOL,INT64,FLOAT64, orDATE columns in theWHERE clause.
Operations onBOOL,INT64,FLOAT64, andDATE columns are typically fasterthan operations onSTRING orBYTE columns. When possible, use a column thatuses one of these data types in theWHERE clause to reduce the amount ofdata returned by the query.
Use materialized views
Best practice: Use materialized views to precompute the results of a queryfor increased performance and efficiency.
Materialized views are precomputed views thatperiodically cache the results of a query for increased performance andefficiency. BigQuery leverages precomputed results frommaterialized views andwheneverpossible reads onlychanges from the base tables to compute up-to-date results. Materializedviews can be queried directly or can be used by theBigQuery optimizer to process queries to the base tables.
Use BI Engine
Best practice: Use BigQuery BI Engine to accelerate queries by caching thedata that you use most frequently.
Consider adding aBI Enginereservation to the project where the queries are being computed.BigQuery BI Engine uses a vectorized query engine to accelerate theSELECTquery performance.
Use search indexes
Best practice: Use search indexes for efficient row lookups when you needto find individual rows of data in large tables.
Asearch index is a data structure designedto enable very efficient search with theSEARCH functionbut can also accelerate queries usingother operators and functions,such as the equal (=),IN, orLIKE operators and certain string and JSONfunctions.
Optimize query operations
You can optimize your query operations by using the options described in thefollowing sections.
Avoid repeatedly transforming data
Best practice: If you are using SQL to perform ETL operations, then avoidsituations where you are repeatedly transforming the same data.
For example, if you are using SQL to trim strings or extract data by usingregular expressions, it is more performant to materialize the transformedresults in a destination table. Functions like regular expressions requireadditional computation. Querying the destination table without the addedtransformation overhead is much more efficient.
Avoid multiple evaluations of the same CTEs
Best practice: Useprocedural language,variables,temporary tables,and automatically expiring tables to persist calculations and use them later inthe query.
When your query containscommon table expressions (CTEs)that are used in multiple places in the query, they might end up being evaluatedeach time they are referenced. The query optimizer attempts to detect parts ofthe query that could be executed only once, but this might not always bepossible. As a result, using a CTE might not help reduce internal querycomplexity and resource consumption.
You can store the result of a CTE in a scalar variable or a temporary tabledepending on the data that the CTE returns.
Avoid repeated joins and subqueries
Best practice: Avoid repeatedly joining the same tables and using the samesubqueries.
Instead of repeatedly joining the data, it might be more performant for you touse nested repeated data to represent the relationships. Nested repeated datasaves you the performance impact of the communication bandwidth that a joinrequires. It also saves you the I/O costs that you incur by repeatedlyreading and writing the same data. For more information, seeuse nested and repeated fields.
Similarly, repeating the same subqueries affects performance through repetitivequery processing. If you are using the same subqueries in multiple queries,consider materializing the subquery results in a table. Then consume thematerialized data in your queries.
Materializing your subquery results improves performance and reduces the overallamount of data that BigQuery reads and writes. The smallcost of storing the materialized data outweighs the performance impact ofrepeated I/O and query processing.
Optimize your join patterns
Best practice: For queries that join data from multiple tables, optimizeyour join patterns by starting with the largest table.
When you create a query by using aJOIN clause, consider the order in which you aremerging the data. The GoogleSQL query optimizer determines which tableshould be on which side of the join. As a best practice, place thetable with the largest number of rows first, followed by the table with thefewest rows, and then place the remaining tables by decreasing size.
When you have a large table as the left side of theJOIN and a small one onthe right side of theJOIN, a broadcast join is created. A broadcast joinsends all the data in the smaller table to each slot that processes the largertable. It is advisable to perform the broadcast join first.
To view the size of the tables in yourJOIN, seeGet information about tables.
Specify primary key and foreign key constraints
Best practice: Specify key constraints in the table schema when table datasatisfies the data integrity requirements ofprimary key or foreign key constraints.The query engine can use the key constraints to optimize query plans.
BigQuery doesn't automatically check for data integrity, so youmust ensure that your data meets the constraints specified in the table schema.If you don't maintain data integrityin tables with specified constraints, your query results might be inaccurate.
Optimize theORDER BY clause
Best practice: When you use theORDER BY clause, ensure that you followthe best practices:
Use
ORDER BYin the outermost query or withinwindow clauses.Push complex operations to the end of the query.Placing anORDER BYclause in the middle of a query greatly impactsperformance unless it is being used in a window function.Another technique for ordering your query is to push complex operations, such asregular expressions and mathematical functions, to the end of the query.This technique reduces the data to be processed before the complex operationsare performed.
Use a
LIMITclause. If you are ordering a very large number of valuesbut don't need to have all of them returned, use aLIMITclause.For example, the following query orders a very large result set and throws aResources exceedederror. The query sorts by thetitlecolumn inmytable.Thetitlecolumn contains millions of values.SELECTtitleFROM`my-project.mydataset.mytable`ORDERBYtitle;
To remove the error, use a query like the following:
SELECTtitleFROM`my-project.mydataset.mytable`ORDERBYtitleDESCLIMIT1000;
Use a window function. If you are ordering a very large number of values,use a window function, and limit data before calling the window function.For example, the following query lists the ten oldestStack Overflow users and their ranking, with the oldest account being rankedlowest:
SELECTid,reputation,creation_date,DENSE_RANK()OVER(ORDERBYcreation_date)ASuser_rankFROMbigquery-public-data.stackoverflow.usersORDERBYuser_rankASCLIMIT10;
This query takes approximately 15 seconds to run. This query uses
LIMITatthe end of the query, but not in theDENSE_RANK() OVERwindow function.Because of this, the query requires all of the data to be sorted on asingle worker node.Instead, you should limit the dataset before computing the window functionin order to improve performance:
WITHusersAS(SELECTid,reputation,creation_date,FROMbigquery-public-data.stackoverflow.usersORDERBYcreation_dateASCLIMIT10)SELECTid,reputation,creation_date,DENSE_RANK()OVER(ORDERBYcreation_date)ASuser_rankFROMusersORDERBYuser_rank;
This query takes approximately 2 seconds to run, while returning thesame results as the previous query.
One caveat is that the
DENSE_RANK()function ranks the data within years, sofor ranking data that spans across multiple years, these queries don't giveidentical results.
Split complex queries into smaller ones
Best practice: Leveragemulti-statement querycapabilities andstored proceduresto perform the computations that were designed as one complex query as multiplesmaller and simpler queries instead.
Complex queries,REGEX functions, and layered subqueries or joins can be slowand resource intensive to run. Trying to fit all computations in one hugeSELECT statement, for example to make it a view, is sometimes an antipattern,and it can result in a slow, resource-intensive query. In extreme cases, theinternal query plan becomes so complex that BigQuery is unable toexecute it.
Splitting up a complex query allows for materializing intermediate results invariables ortemporary tables.You can then use these intermediate results in other parts of the query. It isincreasingly useful when these results are needed in more than one place of thequery.
Often it lets you better express the true intent of parts ofthe query with temporary tables being the data materialization points.
Use nested and repeated fields
For information about how to denormalize data storage using nested and repeatedfields, seeUse nested and repeated fields.
UseINT64 data types in joins
Best practice: UseINT64 data types in joins instead ofSTRING datatypes to reduce cost and improve comparison performance.
BigQuery doesn't index primary keys like traditional databases,so the wider the join column is, the longer the comparison takes. Therefore,usingINT64 data types in joins is cheaper and more efficient than usingSTRING data types.
Reduce query outputs
You can reduce the query outputs by using the options described in the followingthe sections.
Materialize large result sets
Best practice: Considermaterializing large result setsto a destination table. Writing large result sets has performance and costimpacts.
BigQuery limits cached results to approximately 10 GBcompressed. Queries that return larger results overtake this limit andfrequently result in the following error:Response too large.
This error often occurs when you select a large number of fields from a tablewith a considerable amount of data. Issues writing cached results can also occurin ETL-style queries that normalize data without reduction or aggregation.
You can overcome the limitation on cached result size by using the followingoptions:
- Use filters to limit the result set
- Use a
LIMITclause to reduce the result set, especially if you are usinganORDER BYclause - Write the output data to a destination table
You can page through the results using the BigQuery REST API. For moreinformation, seePaging through table data.
Note: Writing very large result sets to destination tables impacts queryperformance (I/O). In addition, you incur a small cost for storing thedestination table. You can automatically delete a large destination table byusing the dataset'sdefault table expiration.For more information, seeUse the expiration settingsin the storage best practices.Avoid anti-SQL patterns
The following best practices provide guidance on avoiding query anti-patternsthat impact performance in BigQuery.
Avoid self joins
Best practice: Instead of using self-joins, use awindow (analytic) functionor thePIVOT operator.
Typically, self-joins are used to compute row-dependent relationships. Theresult of using a self-join is that it potentially squares the number of outputrows. This increase in output data can cause poor performance.
Avoid cross joins
Best practice: Avoid joins that generate more outputs than inputs. When aCROSS JOIN is required, pre-aggregate your data.
Cross joins are queries where each row from the first table is joined to everyrow in the second table, with non-unique keys on both sides. The worstcase output is the number of rows in the left table multiplied by the number ofrows in the right table. In extreme cases, the query might not finish.
If the query job completes, the query plan explanation shows output rowsversus input rows. You can confirm aCartesian product by modifying the query to print the number of rows on each side of theJOINclause, grouped by the join key. You can also check the performance insightsin the query execution graph for ahigh cardinality join.
To avoid performance issues associated with joins that generate more outputsthan inputs:
- Use a
GROUP BYclause topre-aggregate the data. - Use a window function. Window functions are often more efficient than usinga cross join. For more information, seewindow functions.
Avoid DML statements that update or insert single rows
Best practice: AvoidDMLstatements that update or insert single rows. Batch your updates andinserts.
Using point-specific DML statements is an attempt to treatBigQuery like an Online Transaction Processing (OLTP) system.BigQuery focuses on Online Analytical Processing (OLAP) by usingtable scans and not point lookups. If you need OLTP-like behavior (single-rowupdates or inserts), consider a database designed to support OLTP use cases suchasCloud SQL.
BigQuery DML statements are intended for bulk updates.UPDATEandDELETE DML statements in BigQuery are oriented towardsperiodic rewrites of your data, not single row mutations. TheINSERT DMLstatement is intended to be used sparingly. Inserts consume the samemodificationquotasas load jobs. If your use case involves frequent single row inserts, considerstreaming your data instead.
If batching yourUPDATE statements yields many tuples in very long queries,you might approach the query length limit of 256 KB. To work around the querylength limit, consider whether your updates can be handled based on a logicalcriteria instead of a series of direct tuple replacements.
For example, you could load your set of replacement records into another table,then write the DML statement to update all values in the original table if thenon-updated columns match. For example, if the original data is in tablet andthe updates are staged in tableu, the query would look like the following:
UPDATEdataset.ttSETmy_column=u.my_columnFROMdataset.uuWHEREt.my_key=u.my_key
Use alias names for similarly named columns
Best Practice: Use column and table aliases when you work with similarlynamed columns across queries, including subqueries.
Aliases help to identify which columns and tables are referenced in addition toyour initial reference of the column. Using aliases can help you understandand address problems in your SQL query, including finding the columns that are usedin subqueries.
What's next
- Learn how tooptimize cost.
- Learn how tooptimize storage.
- Learn how tooptimize functions.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.