Bulk loading best practices

This page provides guidelines for efficiently bulk loading large amounts of datainto Spanner.

You have several options for bulk loading data into Spanner:

While you can alsoinsert rows using the Google Cloud CLI,we don't recommend that you use the gcloud CLI for bulk loading.

Performance guidelines for bulk loading

To achieve optimal bulk loading performance, maximize your use of partitioningto distribute writing the data across worker tasks.

Spanner usesload-based splitting to evenly distribute yourdata load across the instance's compute resources.After a few minutes of high load, Spanner introducessplit boundaries between rows. In general, if yourdata load is well-distributed and you follow best practices for schema designand bulk loading, your write throughput should double every few minutes untilyou saturate the available CPU resources in your instance.

Partition your data by primary key

Spanner automatically partitions tables into smaller ranges. Theprimary key for a row determines where it is partitioned.

To get optimal write throughput for bulk loads, partition your data by primarykey with this pattern:

Each partition contains a range of consecutive rows, as determined by the keycolumns.
Each commit contains data for only a single partition.

We recommend that the number of partitions be 10 times the number of nodes inyour Spanner instance. To assign rows to partitions:

Sort your data by primary key.
Divide the data into 10 * (number of nodes) separate, equally sizedpartitions.
Create and assign a separate worker task to each partition. Creating theworker tasks happens in your application. It is not a Spannerfeature.

Following this pattern, you should see a maximum overall bulk write throughputof 10-20 MB per second per node for large loads.

As you load data, Spanner creates and updates splits to balancethe load on the nodes in your instance. During this process, you may experiencetemporary drops in throughput.

Example

You have a regional configuration with 3 nodes. You have 90,000 rows in anon-interleaved table. The primary keys in the table range from 1 to 90000.

Rows: 90,000 rows
Nodes: 3
Partitions: 10 * 3 = 30
Rows per partition: 90000 / 30 = 3000.

The first partition includes the key range 1 to 3000. The second partitionincludes the key range 3001 to 6000. The 30th partition includes the key range87001 to 90000. (You shouldn't use sequential keys in a large table. Thisexample is only for demonstration.)

Each worker task sends the writes for a single partition. Within each partition,you should write the rows sequentially by primary key. Writing rows randomly,with respect to the primary key, should also provide reasonably high throughput.Measuring test runs will give you insight into which approach provides the bestperformance for your dataset.

Bulk load without partitioning

Writing a contiguous set of rows in a commit can be faster than writing randomrows. Random rows also likely include data from different partitions.

When more partitions are written into a commit, more coordination acrossservers is required, raising the commit latency and overhead.

Multiple partitions are likely involved because each random row could belongto a different partition. In the worst case scenario, each write involvesevery partition in your Spanner instance. As mentionedpreviously, write throughput is lowered when more partitions are involved.

Avoid overload

It's possible to send more write requests than Spanner canhandle. Spanner handles the overload by aborting transactions,which is called pushback. For write-only transactions, Spannerautomatically retries the transaction. In those cases, the pushback shows up ashigh latency. During heavy loads, pushback can last for up to a minute. Duringseverely heavy loads, pushback can last for several minutes. To avoid pushback,you should throttle write requests to keep CPU utilization withinreasonable limits.Alternatively, users can increase the number of nodes so that CPU utilizationstays within the limits.

Commit between 1 MB to 5 MB of mutations at a time

Each write to Spanner contains some overhead, whether the writeis big or small. To maximize throughput, maximize the amount of data stored perwrite. Larger writes lower the ratio of overhead per write. A good technique isfor each commit to mutate hundreds of rows. When writing relatively large rows,a commit size of 1 MB to 5 MB usually provides thebest performance. When writing small values, or values that are indexed, it isgenerally best to write at most a few hundred rows in a single commit.Independently from the commit size and number of rows, be aware thatthere is a limitation of 80,000mutations per commit.To determine optimal performance, you shouldtest and measurethe throughput.

Commits larger than 5 MB or more than a few hundred rows don'tprovide extra benefit, and they risk exceeding the Spannerlimits on commit size and mutations per commit.

Guidelines for secondary indexes

If your database hassecondary indexes, you must choose between adding theindexes to the database schema before or after loading the table data.

Adding the index before data is loaded allows the schema change to completeimmediately. However, each write affecting the index takes longer since italso needs to update the index. When the data load is complete, the databaseis immediately usable with all indexes in place. To create a table and itsindexes at the same time, send the DDL statements for the new table and thenew indexes in a single request to Spanner.
Adding the index after loading the data means that each write is efficient.However, the schema change for each index backfill can take a long time.The database is not fully usable, and queries can't use the indexes untilall of the schema changes are complete. The database can still serve writesand queries but at a slower rate.

We recommend adding indexes that are critical to your business applicationbefore you load the data. For all non-critical indexes, add them after thedata is migrated.

Use`INTERLEAVE IN` during bulk load

For schemas with many parent child references across tables, you must always load the parent before the children to ensure referential integrity. This orchestration can be tricky, especially with multiple levels of hierarchy. This complexity also makes batching and parallelizing much more difficult and can greatly impact the overall bulk load time. InSpanner, these relationships are enforced usingINTERLEAVE IN PARENT, or foreign keys. See theCREATE TABLE documentation for more details.

Adding a foreign key after bulk load creates an index under the covers, sofollow the guidelines insecondary-indexes.

However, forINTERLEAVE IN PARENT tables it is recommended that you create alltables usingINTERLEAVE IN semantics during bulk load, which physicallyinterleaves rows, but does not enforce referential integrity. This gives you the performance benefits of locality, but does not require the up-front ordering. Now that childrows can be insertedbefore the corresponding parent, this lets Spanner write to all tables in parallel.

Once all tables are loaded, you can then migrate the interleaved tables to startenforcing the parent-child relationship using theALTER TABLE t1 SET INTERLEAVE IN PARENT t2 statement. This validatesreferential integrity, failing if there are any orphaned child rows. Ifvalidation fails, identify missing parent rows using the following query.

SELECTpk1,pk2FROMchildEXCEPTDISTINCTSELECTpk1,pk2FROMparent;

Test and measure the throughput

Predicting throughput can be difficult. We recommend that you test your bulkloading strategy before running the final load. For a detailed example usingpartitioning and monitoring performance, seeMaximizing data load throughput.

Best practices for periodic bulk loading to an existing database

If you are updating an existing database that contains data but does not haveanysecondary indexes, then the recommendations in this document stillapply.

If you do have secondary indexes, the instructions might yield reasonableperformance. Performance depends on how manysplits, on average, areinvolved in your transactions. If throughput drops too low, you can try thefollowing:

Include a smaller number of mutations in each commit, which might increasethroughput.
If your upload is larger than the total current size of the table beingupdated, delete your secondary indexes and then add them again after you uploadthe data. This step is usually not necessary, but it might improve thethroughput.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Bulk loading best practices Stay organized with collections Save and categorize content based on your preferences.