My projects

View more →

Recent guides

View all →

PostgreSQL Preloading

Fetching complex data while avoiding large numbers of queries

PostedAugust 07, 2017 (UpdatedAugust 09, 2018) by leafo (@moonscript) · Tags: postgresql

Avoiding n+1 queries

More preloading strategies

Selecting a subset of columns

Getting a subset of a one to many relationship

Subquery androw_to_json

Querying hierarchical tables with recursive queries

Avoiding n+1 queries

If you're using anORM, or any kind of abstraction over your database,you've likely run into then+1 selects issue. This is when a property accessin your programming language triggers a query to fetch an associated objectthrough a relation.When looping over a set of objects, a new query istriggered each time the field is accessed. This results in 1 query to fetchthe initial set of items, thenn additional queries for each association.Hence the name:n+1.

n+1 queries should be avoided because the round-trip latency of a query canadd up, even if the queries themselves are fast.

Many web frameworks have ways to mitigate this using preloading (or eagerloading). The idea is that you load all of your data in as few queries aspossible before looping over it.

There are two common approaches to solving this:

Use ajoin
Use a second query

Using a`join`

A join enables a query to include data from another relation.

Single query — A join can pull data from many different tables all in a single query
It’s necessary to manage the column names — the columns of the related tables must be renamed to avoid conflicts with the original table. After fetching the data you must then map everything back to how you expect it.
Duplicate data — If many of your results are related to the same object, then you'll duplicate that objects data many times in your result
All tables used must be on same database

Using a second query

In this approach you take the ids you recieved from the initial query, and useanother query using thewhere id in (a,b,c...) syntax.

Multiple queries — Each relation you want to include requires a new query. Instead ofn+1 queries, there arek queries, wherek is the number of relations. (Generally much smaller than the number of rows)
No duplication
General purpose — You can preload objects even if you've recieved a list of ids form someplace other than the database where you're querying from

More preloading strategies

The two approaches above will get you through the majority of preloading you'llneed to do. In the rest of this post I'll cover some techniques I've used forhandling more complex requirements.

Composite types

When selecting fields for a result PostgreSQL expects a single value where youmight want to return multiple values. The most common example is somethingretrurned from a select query:

selectafromSomeTable

If you want to replacea with a subquery that returns multiple rows you'llsee this error:

ERROR:  more than one row returned by a subquery used as an expression

We can use acomposite type in order to return many values in the place ofone. The most primitive composite type is theresult type created by therow constructor (or with plain()):

selectid,row(1,now(),'hello',NULL,name)fromSomeTable

Theresult type is convenient for when queries need to pass composite valuesto one another, but not ideal when returning data to your host language.

When implementing my own PostgreSQL client,pgmoon, I realized thatthe server did not return all the sub-types when returning aresult value.The client would have to parse a string to get the contained values. For thisreason I decided to focus on thejson composite type instead, since the JSONformat is well defined and parsers are fast and ubiquitous. If the PostgreSQLclient you're using doesn’t already parse the JSON result, then it would betrivial to handle it in your language of choice.

You can read more aboutcomposite types in the PostgreSQLdocumentation.

Using JSON as a join alternative

PostgreSQL 9.2 added therow_to_json function, which lets youtransform a row to a JSON object.

Modern PostgreSQL clients are aware of the JSON type and will automaticallydeserialize them when loading the results of the query into your host language.

Consider the following tables:Posts andUsers. Each post has auser_idcolumn that acts as a foreign key to the users table. Given a query to select acollection of posts, we'd like to also preload the users.

You could write the query like this:

select*,(selectrow_to_json(Users)fromUserswhereUsers.id=user_id)asuserfrompostsorderbyiddesclimit20

This will get all the columns from thePosts table, along with all thecolumns from theUsers table namespaced inside the JSON object calleduser:

id, user_id, title, user1, 12, "My first post", {"id": 12, "username": "leafo"}1, 14, "Another thing", {"id": 14, "username": "lee"}

Selecting a subset of columns

If you need to select only a few columns from the associated object then you'llneed to modify the query a bit. The easiest way is to use an additionalsubquery to create a new anonymous row type with names on each field:

select*,(selectrow_to_json(t)from(selectuser.name,user.idfromUserswhereUsers.id=user_id)ast)frompostsorderbyiddesclimit20

You might be tempted to use a result type, but there is no way to give namesto the columns so you'll be stuck with autogenerated names that aren’tdescriptive

Getting a subset of a one to many relationship

Consider the following tables:Collections andCollectionItems. Eachcollection item has a column,collection_id, which serves as a foreign keypointing back to the collection it’s in, along with aadded_at column holdingthe date when the item was added. Each collection has many collection items init. Sometimes there can be thousands of items.

You're coding a page that shows a summary of collections in your database. Onthis page you'd like to show the 10 most recent collection items for eachcollection.

Getting the list of collections we want to show is trivial:

select*fromCollectionsorderbyiddesclimit20

For each collection, getting the list of items is also a simple query:

select*fromCollectionItemswherecollection_id=?orderbyadded_atdesclimit10

Sadly this suffers from the n+1 query problem if we try to fetch the set ofitems for each collection in our host application.

One technique I've seen to solve this is aunion. You generate the SQL forselecting the subset of items for each collection, then contatenate them alltogether with aunion clause. Although you're still written n queries, you'resending them all together in a single statement, so you avoid the overhead thatmakes n+1 queries slow.

Here’s what that query might look like for a few collections:

(select*fromCollectionItemswherecollection_id=1orderbyadded_atdesclimit10)union(select*fromCollectionItemswherecollection_id=2orderbyadded_atdesclimit10)union(select*fromCollectionItemswherecollection_id=3orderbyadded_atdesclimit10)

Parenthesis are used for each query, otherwise thelimit clause will beapplied to the union of all the rows.

With a lot of collections this query becomes quite unwieldy. We can write acleaner query using PostgreSQL arrays along with a subquery.

Because we can’t use aselect statement to return multiple rows, we'll usethearray compsite type to bundle the ids together:

select*,array(selectidfromCollectionItemswherecollection_id=collections.idorderbyadded_atdesclimit10)fromCollectionsorderbyiddesclimit20

With this query you'll get back an array of foreign keys for each row returnedfrom the database. Using the two query approach you can select all thecollection items needed by id, then map them back to the collection objectswith a hash table.

select*fromCollectionItemswhereidin(1,2,3,4,...)

Your PostgreSQL library should be able to deserialize an array type. Ifnot, you might get back a string representation of the array which you'llneed to parse. Alternatively, you can use thejson type instead.

Subquery and`row_to_json`

We can reduce the above example to a single query by combining arrays androw_to_json. The final query looks something like this:

select*,array(selectrow_to_json(CollectionItems)fromCollectionItemswherecollection_id=collections.idorderbyadded_atdesclimit10)fromCollectionsorderbyiddesclimit20

There are two great things about this query: it’s easy to read and it’s easy toprocess back in the application layer. We've solved a problem that’s typicallyannoying in a simple way.

The only change from the previous example rewritingid asrow_to_json(CollectionItems). In any query you can use the table name as analias to the current row and therow_to_json function takes a row as itsargument.

Querying hierarchical tables with recursive queries

Mapping a tree-like hierarchy in a relational database is a common problem. Theeasiest schema is to use aparent_id foreign key that points to a row in thesame table.

This schema is problematic when querying an entire hierarchy. For each row youfetch you'll get ids you need to fetch the next level in the hierarchy, givingyou n queries where n is the depth of the tree.

Because of this there are a fewclever schema tricks you can doit enable more efficient queries. Sadly they make your schema more complicated,and require managing more columns when making updates to the hierarchy.

Using PostgreSQL’s recursive queries syntax we can use the simpleparent_idschema while still having efficient queries.

The are two common queries we'll want to do on hierarchical data:

Given a row,find all the ancestors in a single query. Additionally, we might want to preload the ancestors for many categories in a single query
Given a row,find all the children in a single query. Additionally, we should be able to limit by either depth, or number of children returned for a single depth

Finding ancestors

Here’s a simple schema that you would use to represent a hierarchical structure:

Categories--------idparent_category_idname

The data in this table might look like this

id, parent_category_id, name1, NULL, "Top"2, 1, "Child 1"3, 1, "Child 2"4, 2, "Deep child"

If we have a reference to category id4, a query to access the ancestorsshould return:

id, parent_category_id, name4, 2, "Deep child"2, 1, "Child 1"1, NULL, "Top"

We can do this using a recursive query like this:

withrecursivenestedas((select*fromCategorieswhereid=4)unionselectpr.*fromCategoriespr,nestedwherepr.id=nested.parent_category_id)select*fromnested

Recursive queries and comprised of a base case and a recursive case.

Our base case returns the row identified by id4, our starting point.

select*fromCategorieswhereid=4

The base case isunioned with the recursive case. In this case, the recursivecase operates on an existing row fetched by the query aliased bynested.

selectpr.*fromCategoriespr,nestedwherepr.id=nested.parent_category_id

Every time a new query is added to the results, it will be used against therecursive query to see if any new rows should be added.

Finding ancestors for many objects at once

To preload the ancestors of many categories at once we can make a slightmodification to the query:

withrecursivenestedas((select*fromCategorieswhereidin(4,3))unionselectpr.*fromCategoriespr,nestedwherepr.id=nested.parent_category_id)select*fromnested

All we've done is update the base case to start with multiple categories usingthein syntax.

The resulting rows will be returned breadth first. You'll need to use some codein your application to organize all the returned values into arrays for eachstarting category.

Finding children

Finding child nodes in a hierarchical structure works very similarly. We'll usethe same schema from above, but we'll introduce a position field that controlsthe ordering of the children in the parent.

Categories--------idparent_category_idpositionname

If we wanted to select all the nested children from the0 id category we canwrite a query like this:

withrecursivenestedas(select*fromCategorieswhereid=0unionselectpr.*fromCategoriespr,nestedwherepr.parent_category_id=nested.id)select*fromnested

PostgreSQL will not allow us to add anorder orlimit clause in therecursive query, giving us the errors:

ERROR:  ORDER BY in a recursive query is not implementedERROR:  LIMIT in a recursive query is not implemented

This is not an issue in practice since it would be your applications job torebuild the nested structure after fetching all the rows, and that couldinclude sorting what’s returned. The real advantage of having a position field(ordered naturally) is to limit the number of rows returned. If we wanted topreview the first N categories the we can filter by the position field:

withrecursivenestedas(select*fromCategorieswhereid=0unionselectpr.*fromCategoriespr,nestedwherepr.parent_category_id=nested.idandpr.position<10)select*fromnested

Indexes

The same index rules apply to recursive queries as regular queries. If you wantyour recursive query to be efficient then you'll need an appropriate index.Although recursive queries may trigger many queries internally in the database,they're fast because they don’t impose any round trip overhead for eachrecursive step.

In the ancestors example, no additional indexes are needed other than theprimary keyid. Accessing a field onnested does not need an index sincethat’s the current row, but fields referenced inpr will need an index.

In the children example, an index onparent_category_id is needed:

createindexonCategories(parent_category_id)

In the example with fetching children with a limit, we should have a compositeindex onparent_category_id andposition. Column order is important forthis index since we're using the less than range operator,position needs tobe last.

createindexonCategories(parent_category_id,position)

With the composite index, a separate index onparent_category_id is redundant.

Here are some more guides tagged 'postgresql'

Calculating Percentile (and Median) in PostgreSQL

PostedAugust 09, 2018

Using PostgreSQL with OpenResty

PostedJuly 04, 2015

Movatterモバイル変換

My projects

Recent guides

Recent posts

PostgreSQL Preloading

Fetching complex data while avoiding large numbers of queries

Avoiding n+1 queries

Using a`join`

Using a second query

More preloading strategies

Composite types

Using JSON as a join alternative

Selecting a subset of columns

Getting a subset of a one to many relationship

Subquery and`row_to_json`

Querying hierarchical tables with recursive queries

Finding ancestors

Finding ancestors for many objects at once

Finding children

Indexes

Movatterモバイル変換

My projects

Recent guides

Recent posts

PostgreSQL Preloading

Fetching complex data while avoiding large numbers of queries

Avoiding n+1 queries

Using ajoin

Using a second query

More preloading strategies

Composite types

Using JSON as a join alternative

Selecting a subset of columns

Getting a subset of a one to many relationship

Subquery androw_to_json

Querying hierarchical tables with recursive queries

Finding ancestors

Finding ancestors for many objects at once

Finding children

Indexes

Using a`join`

Subquery and`row_to_json`