If you're using anORM, or any kind of abstraction over your database,you've likely run into then+1 selects issue. This is when a property accessin your programming language triggers a query to fetch an associated objectthrough a relation.When looping over a set of objects, a new query istriggered each time the field is accessed. This results in 1 query to fetchthe initial set of items, thenn additional queries for each association.Hence the name:n+1.
n+1 queries should be avoided because the round-trip latency of a query canadd up, even if the queries themselves are fast.
Many web frameworks have ways to mitigate this using preloading (or eagerloading). The idea is that you load all of your data in as few queries aspossible before looping over it.
There are two common approaches to solving this:
joinjoinA join enables a query to include data from another relation.
In this approach you take the ids you recieved from the initial query, and useanother query using thewhere id in (a,b,c...) syntax.
n+1 queries, there arek queries, wherek is the number of relations. (Generally much smaller than the number of rows)The two approaches above will get you through the majority of preloading you'llneed to do. In the rest of this post I'll cover some techniques I've used forhandling more complex requirements.
When selecting fields for a result PostgreSQL expects a single value where youmight want to return multiple values. The most common example is somethingretrurned from a select query:
selectafromSomeTableIf you want to replacea with a subquery that returns multiple rows you'llsee this error:
ERROR: more than one row returned by a subquery used as an expressionWe can use acomposite type in order to return many values in the place ofone. The most primitive composite type is theresult type created by therow constructor (or with plain()):
selectid,row(1,now(),'hello',NULL,name)fromSomeTableTheresult type is convenient for when queries need to pass composite valuesto one another, but not ideal when returning data to your host language.
When implementing my own PostgreSQL client,pgmoon, I realized thatthe server did not return all the sub-types when returning aresult value.The client would have to parse a string to get the contained values. For thisreason I decided to focus on thejson composite type instead, since the JSONformat is well defined and parsers are fast and ubiquitous. If the PostgreSQLclient you're using doesn’t already parse the JSON result, then it would betrivial to handle it in your language of choice.
You can read more aboutcomposite types in the PostgreSQLdocumentation.
PostgreSQL 9.2 added therow_to_json function, which lets youtransform a row to a JSON object.
Modern PostgreSQL clients are aware of the JSON type and will automaticallydeserialize them when loading the results of the query into your host language.
Consider the following tables:Posts andUsers. Each post has auser_idcolumn that acts as a foreign key to the users table. Given a query to select acollection of posts, we'd like to also preload the users.
You could write the query like this:
select*,(selectrow_to_json(Users)fromUserswhereUsers.id=user_id)asuserfrompostsorderbyiddesclimit20This will get all the columns from thePosts table, along with all thecolumns from theUsers table namespaced inside the JSON object calleduser:
id, user_id, title, user1, 12, "My first post", {"id": 12, "username": "leafo"}1, 14, "Another thing", {"id": 14, "username": "lee"}If you need to select only a few columns from the associated object then you'llneed to modify the query a bit. The easiest way is to use an additionalsubquery to create a new anonymous row type with names on each field:
select*,(selectrow_to_json(t)from(selectuser.name,user.idfromUserswhereUsers.id=user_id)ast)frompostsorderbyiddesclimit20You might be tempted to use a result type, but there is no way to give namesto the columns so you'll be stuck with autogenerated names that aren’tdescriptive
Consider the following tables:Collections andCollectionItems. Eachcollection item has a column,collection_id, which serves as a foreign keypointing back to the collection it’s in, along with aadded_at column holdingthe date when the item was added. Each collection has many collection items init. Sometimes there can be thousands of items.
You're coding a page that shows a summary of collections in your database. Onthis page you'd like to show the 10 most recent collection items for eachcollection.
Getting the list of collections we want to show is trivial:
select*fromCollectionsorderbyiddesclimit20For each collection, getting the list of items is also a simple query:
select*fromCollectionItemswherecollection_id=?orderbyadded_atdesclimit10Sadly this suffers from the n+1 query problem if we try to fetch the set ofitems for each collection in our host application.
One technique I've seen to solve this is aunion. You generate the SQL forselecting the subset of items for each collection, then contatenate them alltogether with aunion clause. Although you're still written n queries, you'resending them all together in a single statement, so you avoid the overhead thatmakes n+1 queries slow.
Here’s what that query might look like for a few collections:
(select*fromCollectionItemswherecollection_id=1orderbyadded_atdesclimit10)union(select*fromCollectionItemswherecollection_id=2orderbyadded_atdesclimit10)union(select*fromCollectionItemswherecollection_id=3orderbyadded_atdesclimit10)Parenthesis are used for each query, otherwise the
limitclause will beapplied to the union of all the rows.
With a lot of collections this query becomes quite unwieldy. We can write acleaner query using PostgreSQL arrays along with a subquery.
Because we can’t use aselect statement to return multiple rows, we'll usethearray compsite type to bundle the ids together:
select*,array(selectidfromCollectionItemswherecollection_id=collections.idorderbyadded_atdesclimit10)fromCollectionsorderbyiddesclimit20With this query you'll get back an array of foreign keys for each row returnedfrom the database. Using the two query approach you can select all thecollection items needed by id, then map them back to the collection objectswith a hash table.
select*fromCollectionItemswhereidin(1,2,3,4,...)Your PostgreSQL library should be able to deserialize an array type. Ifnot, you might get back a string representation of the array which you'llneed to parse. Alternatively, you can use the
jsontype instead.
row_to_jsonWe can reduce the above example to a single query by combining arrays androw_to_json. The final query looks something like this:
select*,array(selectrow_to_json(CollectionItems)fromCollectionItemswherecollection_id=collections.idorderbyadded_atdesclimit10)fromCollectionsorderbyiddesclimit20There are two great things about this query: it’s easy to read and it’s easy toprocess back in the application layer. We've solved a problem that’s typicallyannoying in a simple way.
The only change from the previous example rewritingid asrow_to_json(CollectionItems). In any query you can use the table name as analias to the current row and therow_to_json function takes a row as itsargument.
Mapping a tree-like hierarchy in a relational database is a common problem. Theeasiest schema is to use aparent_id foreign key that points to a row in thesame table.
This schema is problematic when querying an entire hierarchy. For each row youfetch you'll get ids you need to fetch the next level in the hierarchy, givingyou n queries where n is the depth of the tree.
Because of this there are a fewclever schema tricks you can doit enable more efficient queries. Sadly they make your schema more complicated,and require managing more columns when making updates to the hierarchy.
Using PostgreSQL’s recursive queries syntax we can use the simpleparent_idschema while still having efficient queries.
The are two common queries we'll want to do on hierarchical data:
Here’s a simple schema that you would use to represent a hierarchical structure:
Categories--------idparent_category_idnameThe data in this table might look like this
id, parent_category_id, name1, NULL, "Top"2, 1, "Child 1"3, 1, "Child 2"4, 2, "Deep child"If we have a reference to category id4, a query to access the ancestorsshould return:
id, parent_category_id, name4, 2, "Deep child"2, 1, "Child 1"1, NULL, "Top"We can do this using a recursive query like this:
withrecursivenestedas((select*fromCategorieswhereid=4)unionselectpr.*fromCategoriespr,nestedwherepr.id=nested.parent_category_id)select*fromnestedRecursive queries and comprised of a base case and a recursive case.
Our base case returns the row identified by id4, our starting point.
select*fromCategorieswhereid=4The base case isunioned with the recursive case. In this case, the recursivecase operates on an existing row fetched by the query aliased bynested.
selectpr.*fromCategoriespr,nestedwherepr.id=nested.parent_category_idEvery time a new query is added to the results, it will be used against therecursive query to see if any new rows should be added.
To preload the ancestors of many categories at once we can make a slightmodification to the query:
withrecursivenestedas((select*fromCategorieswhereidin(4,3))unionselectpr.*fromCategoriespr,nestedwherepr.id=nested.parent_category_id)select*fromnestedAll we've done is update the base case to start with multiple categories usingthein syntax.
The resulting rows will be returned breadth first. You'll need to use some codein your application to organize all the returned values into arrays for eachstarting category.
Finding child nodes in a hierarchical structure works very similarly. We'll usethe same schema from above, but we'll introduce a position field that controlsthe ordering of the children in the parent.
Categories--------idparent_category_idpositionnameIf we wanted to select all the nested children from the0 id category we canwrite a query like this:
withrecursivenestedas(select*fromCategorieswhereid=0unionselectpr.*fromCategoriespr,nestedwherepr.parent_category_id=nested.id)select*fromnestedPostgreSQL will not allow us to add anorder orlimit clause in therecursive query, giving us the errors:
ERROR: ORDER BY in a recursive query is not implementedERROR: LIMIT in a recursive query is not implementedThis is not an issue in practice since it would be your applications job torebuild the nested structure after fetching all the rows, and that couldinclude sorting what’s returned. The real advantage of having a position field(ordered naturally) is to limit the number of rows returned. If we wanted topreview the first N categories the we can filter by the position field:
withrecursivenestedas(select*fromCategorieswhereid=0unionselectpr.*fromCategoriespr,nestedwherepr.parent_category_id=nested.idandpr.position<10)select*fromnestedThe same index rules apply to recursive queries as regular queries. If you wantyour recursive query to be efficient then you'll need an appropriate index.Although recursive queries may trigger many queries internally in the database,they're fast because they don’t impose any round trip overhead for eachrecursive step.
In the ancestors example, no additional indexes are needed other than theprimary keyid. Accessing a field onnested does not need an index sincethat’s the current row, but fields referenced inpr will need an index.
In the children example, an index onparent_category_id is needed:
createindexonCategories(parent_category_id)In the example with fetching children with a limit, we should have a compositeindex onparent_category_id andposition. Column order is important forthis index since we're using the less than range operator,position needs tobe last.
createindexonCategories(parent_category_id,position)With the composite index, a separate index on
parent_category_idis redundant.
leafo.net · Generated Sun Oct 8 13:02:35 2023 bySitegenmastodon.social/@leafo