Optimizing Schema Design for Spanner

Google storage technologies power some of the world's largest applications.However, scale is not always an automatic result of using these systems.Designers must think carefully about how to model their data to ensure thattheir application can scale and perform as it grows in various dimensions.

Spanner is adistributed database, and using it effectively requiresthinking differently about schema design and access patterns than you might withtraditional databases. Distributed systems, by their nature, force designers tothink about data and processing locality.

Spanner supports SQL queries and transactions with the ability toscale out horizontally. Careful design is often necessary to realizeSpanner's full benefit. This paper discusses some of the keyideas that will help you to ensure that your application can scale to arbitrarylevels, and to maximize its performance. Two tools in particular have a greatimpact on scalability: key definition and interleaving.

Table layout

Rows in a Spanner table are organized lexicographically byPRIMARY KEY. Conceptually, keys are ordered by the concatenation of thecolumns in the order that they are declared in thePRIMARY KEY clause. Thisexhibits all the standard properties of locality:

Scanning the table in lexicographic order is efficient.
Sufficiently close rows will be stored in the same disk blocks, and will beread and cached together.

Spanner replicates your data across multiplezones for availability and scale. Each zone holdsa complete replica of your data. When you provision a Spannerinstance node, you specify itscomputecapacity. The compute capacity is the amount ofcompute resource allocated to your instance in each of these zones. While eachreplica is a complete set of your data, data within a replica is partitionedacross the compute resources in that zone.

Data within each Spanner replica is organized into two levels ofphysical hierarchy:database splits, thenblocks. Splits hold contiguousranges of rows, and are the unit by which Spanner distributesyour database across compute resources. Over time, splits may be broken intosmaller parts, merged, or moved to other nodes in your instance to increaseparallelism and allow your application to scale. Operations that span splits aremore expensive than equivalent operations that don't, due to increasedcommunication. This is true even if those splits happen to be served by the samenode.

There are two types of tables in Spanner:root tables(sometimes called top-level tables), andinterleaved tables. Interleavedtables are defined by specifying another table as itsparent, causing rows inthe interleaved table to be clustered with the parent row. Root tables have noparent, and each row in a root table defines a new top-level row, orroot row.Rows interleaved with this root row are calledchild rows, and the collectionof a root row plus all its descendants is called arow tree. The parent rowmust exist before you can insert child rows. The parent row can either alreadyexist in the database or can be inserted before the insertion of the child rowsin the same transaction.

Spanner automatically partitions splits when it deems necessarydue to size or load. To preserve data locality, Spanner prefersadding split boundaries as close as to theroot tables, so that any given rowtree can be kept in a single split. This means that operations within a row treetend to be more efficient because they are unlikely to require communicationwith other splits.

However, if there is hotspot in a child row, Spanner will attemptto add split boundaries tointerleaved tables in order to isolate that hotspotrow, along with all child rows below it.

Choosing which tables should be roots is an important decision in designing yourapplication to scale. Roots are typically things like Users, Accounts, Projectsand the like, and their child tables hold most of the other data about theentity in question.

Recommendations:

Use a common key prefix for related rows in the same table to improvelocality.
Interleave related data into another table whenever it makes sense.

Tradeoffs of locality

If data is frequently written or read together, it can benefit both latency andthroughput to cluster them by carefully selecting primary keys and usinginterleaving. This is because there is a fixed cost to communicating to anyserver or disk block, so why not get as much as possible while there?Furthermore, the more servers that you communicate with, the higher the chancethat you are going to encounter a temporarily busy server, increasing taillatencies. Finally, transactions that span splits, while automatic andtransparent in Spanner, have a slightly higher CPU cost andlatency due to the distributed nature of two-phase commit.

On the flip side, if data is related but not frequently accessed together,consider going out of your way to separate them. This has the most benefit whenthe infrequently accessed data is large. For example, many databases store largebinary data out-of-band from the primary row data, with only references to thelarge data interleaved.

Note that some level of two-phase commit and non-local data operations areunavoidable in a distributed database. Don't become overly concerned withgetting a perfect locality story for every operation. Focus on getting thedesired locality for the most important root entities and most common accesspatterns, and let less frequent or less performance sensitive distributedoperations happen when they need to. Two-phase commit and distributed reads arethere to help simplify schemas and ease programmer toil: in all but the mostperformance-critical use cases, it is better to let them.

Recommendations:

Organize your data into hierarchies such that data read or written togethertends to be nearby.
Consider storing large columns in non-interleaved tables if less frequentlyaccessed.

Index options

Secondary indexes allow you to quickly find rows by valuesother than the primary key. Spanner supports both non-interleavedand interleaved indexes. Non-interleaved indexes are the default and the typemost analogous to what is supported in a traditionalRDBMS. They do not place any restrictions over thecolumns being indexed and, while powerful, they are not always the best choice.Interleaved indexes must be defined over columns that share a prefix with theparent table, and allow greater control of locality.

Spanner stores index data in the same way as tables, with one rowper index entry. Many of the design considerations for tables also apply toindexes. Non-interleaved indexes store data in root tables. Because root tablescan be split between any root row, this ensures that non-interleaved indexes canscale to arbitrary size and, ignoring hot spots, to almost any workload.Unfortunately it also means that the index entries are usually not in the samesplits as the primary data. This creates extra work and latency for any writingprocess, and adds additional splits to consult at read time.

Interleaved indexes, by contrast, store data in interleaved tables. They aresuitable when you are searching within the domain of a single entity.Interleaved indexes force data and index entries to remain in the same row tree,making joins between them far more efficient. Examples of uses for aninterleaved index:

Accessing your photos by various sort orders like taken date, last modifieddate, title, album, etc.
Finding all your posts that have a particular set of tags.
Finding my previous shopping orders that contained a specific item.

Recommendations:

Use non-interleaved indexes when you need to find rows from anywhere in yourdatabase.
Prefer interleaved indexes whenever your searches are scoped to a singleentity.

STORING index clause

Secondary indexes allow you to find rows by attributes other than the primarykey. If all the data requested is in the index itself, it can be consulted onits own without reading the primary record. This can save significant resourcesas no join is required.

Unfortunately, index keys are limited to 16 in number and 8 KiB in aggregatesize, restricting what can be put in them. To compensate for these limitations,Spanner has the ability to store extra data in any index, usingtheSTORING clause.STORING a column in an index results in its values beingduplicated, with a copy stored in the index. You can think of an index withSTORING as a simple single table materialized view (views are not nativelysupported in Spanner at this time).

Another useful application ofSTORING is as part of aNULL_FILTERED index.This allows you to define what is effectively a materialized view of a sparsesubset of a table that you can scan efficiently. For example, you might createsuch an index on theis_unread column of a mailbox to be able to serve theunread messages view in a single table scan, but without paying for a completecopy of every mailbox.

Recommendations:

Make prudent use ofSTORING to tradeoff read time performance againststorage size and write time performance.
UseNULL_FILTERED to control storage costs of sparse indexes.

Anti-patterns

Anti-pattern: timestamp ordering

Many schema designers are inclined to define a root table that is timestampordered, and updated on every write. Unfortunately, this is one of the leastscalable things that you can do. The reason is that this design results in ahugehot spot at the end of the table that can't easily be mitigated. Aswrite rates increase, so do RPCs to a single split, as do lock contention eventsand other problems. Often these sorts of problems don't appear in small loadtests, and instead appear after the application has been in production for sometime. By then, it's too late!

If your application absolutely must include a log that is timestamp ordered,consider if you can make the log local by interleaving it in one of your otherroot tables. This has the benefit of distributing the hot spot over many roots.But you still need to be careful that each distinct root has sufficiently lowwrite rate.

If you need a global (cross root) timestamp ordered table, and you need tosupport higher write rates to that table than a single node is capable of, useapplication-levelsharding. Sharding a table means partitioning it into somenumber N of roughly equal divisions called shards. This is typically done byprefixing the original primary key with an additionalShardId column holdinginteger values between[0, N). TheShardId for a given write is typicallyselected either at random, or by hashing a part of the base key. Hashing isoften preferred because it can be used to ensure all records of a given type gointo the same shard, improving performance of retrieval. Either way, the goal isto ensure that, over time, writes are distributed across all shards equally.This approach sometimes means that reads need to scan all shards to reconstructthe original total ordering of writes.

Recommendations:

Avoid high write-rate timestamp ordered tables and indexesat all cost.
Use some technique to spread hot spots, be it interleaving in another tableor sharding.

Anti-pattern: sequences

Application developers love using database sequences (or auto-increment) togenerate primary keys. Unfortunately, this habit from the RDBMS days (calledsurrogate keys) is almost as harmful as the timestamp ordering anti-patterndescribed above. The reason is that database sequences tend to emit values in aquasi-monotonic way, over time, to producing values that are clustered near eachother. This typically produces hot spots when used as primary keys, especiallyfor root rows.

Contrary to RDBMS conventional wisdom, we recommend that you use real-worldattributes for primary keys whenever it makes sense. This is particularly thecase if the attribute is never going to change.

If you want to generate numerical unique primary keys, aim to get the high orderbits of subsequent numbers to be distributed roughly equally over the entirenumber space. One trick is to generate sequential numbers by conventional means,and then bit-reversing to obtain a final value. Alternatively you could lookinto aUUID generator, but becareful: not all UUID functions are created equally, and some store thetimestamp in the high order bits, effectively defeating the benefit. Make sureyour UUID generator pseudo-randomly chooses high order bits.

Recommendations:

Avoid using incrementing sequence values as primary keys. Instead,bit-reverse a sequence value, or use a carefully chosen UUID.
Use real-world values for primary keys rather than surrogate keys.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Optimizing Schema Design for Spanner Stay organized with collections Save and categorize content based on your preferences.

Table layout

Recommendations:

Tradeoffs of locality

Recommendations:

Index options

Recommendations:

STORING index clause

Recommendations:

Anti-patterns

Anti-pattern: timestamp ordering

Recommendations:

Anti-pattern: sequences

Recommendations:

Optimizing Schema Design for Spanner