Use nested and repeated fields

BigQuery can be used with many different data modelling methods, andgenerally provides high performance across many data model methodologies. To furthertune a data model for performance, one method you might consider is data denormalization,which means adding columns of data to a single table to reduce or remove table joins.

Best practice: Use nested and repeated fields to denormalize datastorage and increase query performance.

Denormalization is a common strategy for increasing read performance forrelational datasets that were previously normalized. The recommended way todenormalize data in BigQuery is to use nested and repeatedfields. It's best to use this strategy when the relationships are hierarchicaland frequently queried together, such as in parent-child relationships.

The storage savings from using normalized data has less of an effect in modernsystems. Increases in storage costs are worth the performance gains of usingdenormalized data. Joins require data coordination (communicationbandwidth). Denormalization localizes the data to individualslots, so that execution can be done in parallel.

To maintain relationships while denormalizing your data, you can use nestedand repeated fields instead of completely flattening your data. When relationaldata is completely flattened, network communication (shuffling) can negativelyimpact query performance.

For example, denormalizing an orders schema without using nested and repeatedfields might require you to group the data by a field likeorder_id(when there is a one-to-many relationship). Because of the shuffling involved,grouping the data is less effective than denormalizing the data by usingnested and repeated fields.

In some circumstances, denormalizing your data and using nested and repeatedfields doesn't result in increased performance. For example, star schemas aretypically optimized schemas for analytics, and as a result, performance mightnot be significantly different if you attempt to denormalize further.

Using nested and repeated fields

BigQuery doesn't require a completely flat denormalization. Youcan use nested and repeated fields to maintain relationships.

  • Nesting data (STRUCT)

    • Nesting data lets you represent foreign entities inline.
    • Querying nested data uses "dot" syntax to reference leaf fields, which issimilar to the syntax using a join.
    • Nested data is represented as aSTRUCT typein GoogleSQL.
  • Repeated data (ARRAY)

    • Creating a field of typeRECORD with the mode set toREPEATED letsyou preserve a one-to-many relationship inline (so long as the relationshipisn't high cardinality).
    • With repeated data, shuffling is not necessary.
    • Repeated data is represented as anARRAY. You can use anARRAY functionin GoogleSQL when you query the repeated data.
  • Nested and repeated data (ARRAY ofSTRUCTs)

    • Nesting and repetition complement each other.
    • For example, in a table of transaction records, you could include an arrayof line itemSTRUCTs.

For more information, seeSpecify nested and repeated columns in table schemas.

For more information about denormalizing data, seeDenormalization.

Example

Consider anOrders table with a row for each line item sold:

Order_IdItem_Name
001A1
001B1
002A1
002C1

If you wanted to analyze data from this table, you would need to use aGROUP BY clause, similar to the following:

SELECTCOUNT(Item_Name)FROMOrdersGROUPBYOrder_Id;

TheGROUP BY clause involves additional computation overhead, but this can beavoided by nesting repeated data. You can avoid using aGROUP BY clauseby creating a table with one order per row, where the order line items are in anested field:

Order_Id Item_Name
001A1

B1
002A1

C1

In BigQuery, you typically specify a nested schema as anARRAYofSTRUCT objects. You use theUNNEST operatortoflatten the nested data,as shown in the following query:

SELECT*FROMUNNEST([STRUCT('001'ASOrder_Id,['A1','B1']ASItem_Name),STRUCT('002'ASOrder_Id,['A1','C1']ASItem_Name)]);

This query yields results similar to the following:

Query output with unnested data

If this data wasn't nested, you could potentially have several rows foreach order, one for each item sold in that order, which would result in alarge table and an expensiveGROUP BY operation.

Exercise

You can see the performance difference in queries that use nested fields ascompared to those that don't by following the steps in this section.

  1. Create a table based on thebigquery-public-data.stackoverflow.commentspublic dataset:

    CREATEORREPLACETABLE`PROJECT.DATASET.stackoverflow`AS(SELECTuser_id,post_id,creation_dateFROM`bigquery-public-data.stackoverflow.comments`);
  2. Using thestackoverflow table, run the following query to see theearliest comment for each user:

    SELECTuser_id,ARRAY_AGG(STRUCT(post_id,creation_dateASearliest_comment)ORDERBYcreation_dateASCLIMIT1)[OFFSET(0)].*FROM`PROJECT.DATASET.stackoverflow`GROUPBYuser_idORDERBYuser_idASC;

    This query takes about 25 seconds to run and processes 1.88 GB of data.

  3. Create a second table with identical data that creates acomments fieldusing aSTRUCT type to store thepost_id andcreation_date data, insteadof two individual fields:

    CREATEORREPLACETABLE`PROJECT.DATASET.stackoverflow_nested`AS(SELECTuser_id,ARRAY_AGG(STRUCT(post_id,creation_date)ORDERBYcreation_dateASC)AScommentsFROM`bigquery-public-data.stackoverflow.comments`GROUPBYuser_id);
  4. Using thestackoverflow_nested table, run the following queryto see the earliest comment for each user:

    SELECTuser_id,(SELECTASSTRUCTpost_id,creation_dateasearliest_commentFROMUNNEST(comments)ORDERBYcreation_dateASCLIMIT1).*FROM`PROJECT.DATASET.stackoverflow_nested`ORDERBYuser_idASC;

    This query takes about 10 seconds to run and processes 1.28 GB of data.

  5. Delete thestackoverflowandstackoverflow_nested tables when you are finished with them.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.