Posted onJul 5, 2022

Speeding Up Bulk Loading in PostgreSQL

#postgres #supabase #sql #database

Speeding Up Bulk Loading in PostgreSQL

Testing 4 ways to bulk load data into PostgreSQL

The Need For Speed

If you only need to load a few hundred records into your database, you probably aren't too concerned about efficiency. But what happens when try to insert thousands, or even millions of records? Now, data-loading efficiency can mean the difference between success and failure for your project, or at the very least the difference between a project that’s delivered timely and one that’s woefully overdue.

PostgreSQL has a greatcopy command that’s optimized for this task:https://www.postgresql.org/docs/current/sql-copy.html. But that’s only a good solution if your data is specifically in a CSV (or Binary) file. But what if you need to load data from pure SQL? Then what’s the fastest way?

Four Ways to Insert Data

Basic Insert Commands

Let’s look at the structure for some basic SQL insert commands:

createtableusers(idinteger,firstnametext,lastnametext);insertintousers(id,firstname,lastname)values(1,'George','Washington');insertintousers(id,firstname,lastname)values(2,'John','Adams');insertintousers(id,firstname,lastname)values(3,'Thomas','Jefferson');

Now we have some basic SQL for inserting records into our user table. This will get the data into our table, alright, but it's the slowest way to get data into our table. Let's look at some ways we can speed things up.

Transactions

A quick and easy way to speed things up is simply to put large batches of insert statements inside a transaction:

begintransaction;insertintousers(id,firstname,lastname)values(1,'George','Washington');insertintousers(id,firstname,lastname)values(2,'John','Adams');insertintousers(id,firstname,lastname)values(3,'Thomas','Jefferson');commit;

In my Windows test, this doubled the speed of the insert of 100k user records. Under MacOS, the speed tripled. While you can technically create batches with billions of records in them, you'll probably want to experiment with batch sizes of, say 1000, 10000, 100000, or something like that to see what works best based on your hardware, bandwidth, and record size.

Batched Inserts

Another way to speed things up is to use the SQL batch insert syntax when doing your insert. For example:

insertintousers(id,firstname,lastname)values(1,'George','Washington'),(2,'John','Adams'),(3,'Thomas','Jefferson');

This method speeds things up considerably. In my tests, it was about 6 times faster. The same rules apply to batch sizes as with transactions -- you'll want to test different batch sizes to optimize things. I generally tend to start with a batch of around 10000 records for most applications and if that works well enough, I leave it there.

What About Both?

Can you combine transactions and batch insert statements for even more speed? Well, yes, and no. You certainly can combine them, but the speed increase is negligible (or in my Windows test case it even slowed things down just a bit.)

begintransaction;insertintousers(id,firstname,lastname)values(1,'George','Washington'),(2,'John','Adams'),(3,'Thomas','Jefferson');commit;

So, while using both techniques here is perfectly valid, it may not be the fastest way to load data.

The Downsides

What are the potential downsides of using transactions or batched inserts? Error handling is the main one. If any one of the records in your batch fails, the entire batch will fail and no data will be inserted into your table from that batch. So you'll need to make sure your data is valid or else have some way to break up and fix failed batches.

If the failure is caused by a unique constraint, you can use theon conflict clause in your insert statement, but if your insert fails for any other reason, it'll throw out the whole batch.

Other Speed Considerations

There are many other factors that can affect your data insert speed and ways you can make things even faster. Removing indexes until after inserting data, creating non-logged tables, and avoiding unnecessary unique keys are just a few of these. These other optimizations will improve performance, but probably not nearly as dramatically as the basic techniques described here.

Conclusion

If you need to deal with large amounts of data, it pays to plan ahead when you're writing your SQL insert code. A few small changes can potentially save you hours (or sometimes even days) of processing time.

Appendix: Sample Test Results

See my GitHub Repopostgresql-bulk-load-tests for some code to test these methods. My test run results are listed below.

===========================Windows VM (UTM Windows 11)===========================create 100k users with individual insert statements30.0 secondscreate 100k users with individual insert statements in a transaction14.0 secondscreate 100k users with batch insert statement4.3 secondscreate 100k users with batch insert statement in a transaction4.6 seconds====================MacBook Pro (M1 Max)====================create 100k users with individual insert statementsreal    0m9.112suser    0m0.509ssys     0m0.337screate 100k users with individual insert statements in a transactionreal    0m2.540suser    0m0.457ssys     0m0.325screate 100k users with batch insert statementreal    0m1.360suser    0m0.179ssys     0m0.042screate 100k users with batch insert statement in a transactionreal    0m1.154suser    0m0.189ssys     0m0.041s