Posted onDec 25, 2024

Trees in SQL

#sql #database #tutorial #computerscience

Business applications often need to represent and store information such as a company's organizational hierarchy, a country's administrative divisions, an inventory structure, or simply a set of folders and files. Despite their apparent differences, these examples share a common feature— they are all trees. Not the green, wood-made kind that provide pleasant shade on a sunny day, but, mathematically speaking, connected acyclic graphs. In computer science, the links between nodes are described in terms of parent-child relationships. Each node in the tree, except the root, has exactly one parent and zero, one, or many children.

The figure above shows an example of a tree. Vertex A is called the root of the tree. Nodes without subtrees — C, D, F, G, and H — are called leaves. Edges of the tree show parent-child relationships between nodes. A rooted tree is k-ary if each node has no more than k children. A binary tree is a well-known particular case of a k-ary tree where k = 2.

One of the common tasks performed on tree data is the retrieval of all child nodes at any nesting level, given a parent node. For example, calculating the total turnover of an account involves determining the turnovers of all its subaccounts, or compiling a list of goods belonging to a given goods group includes lists from all the nested subgroups.

In imperative programming languages, we usually use a graph traversal algorithm to find all subnodes. Although most modern database servers support SQL queries with recursive common table expressions, using the recursive part inside complex queries on large amounts of data can significantly degrade performance.

The reason is that SQL deals primarily with sets, not graphs, so one should implement dedicated data structures to represent nodes and edges as nested sets for efficient data selection.

In this article, we will consider four ways of storing tree-like structures inside a relational database and evaluate them in terms of query performance and the space needed for storing the underlying data.

Link to the parent

Perhaps the most intuitive and easiest way to represent tree-like data is to use the adjacency matrix in a relational table. Every record in the table corresponds to a node in the tree and holds the unique ID of the node (primary key) and the ID of its parent (for root nodes, the parent is null).

Below is a DDL query that creates thetest1 table:

CREATETABLEtest1(idINTEGERNOTNULL,parentINTEGER,nameVARCHAR(60)NOTNULL,PRIMARYKEY(id),FOREIGNKEY(parent)REFERENCEStest1(id)ONUPDATECASCADEONDELETECASCADE)

Pay attention to theON DELETE CASCADE clause of theFOREIGN KEY constraint declared on thePARENT field. It guarantees that the entire subtree will be deleted at once if the parent node is deleted. To protect the subtree from accidental deletion, the rule should be changed fromCASCADE toRESTRICT.

Corresponding data for the tree shown at the beginning of the article looks like:

id	parent	name
1	NULL	A
2	1	B
3	1	E
4	1	H
5	2	C
6	2	D
7	3	F
8	3	G

Operations on tree data

Addition of a new tree node is done by theINSERT INTO operator:

INSERTINTOtest1(id,parent)VALUES(<uniqueidofthenode>,<idoftheparentnodeorNULLforarootnode>)

To move the node from one parent to another, a value of thePARENT field has to be changed:

UPDATEtest1SETparent=<idofthenewparent>WHEREid=<idofthenodebeingmoved>

Obviously, to delete the node, one should invokeDELETE FROM operator providing ID of the node.

DELETEFROMtest1WHEREid=<idofthenode>

Selecting whole subtree of the given node

With such an organization of the data, a simpleSELECT operator allows us to fetch only one level of nesting of node's descendants.

SELECTidFROMtest1WHEREparent=:P

So, what should we do if the task is to extract all subnodes, regardless of how deep they lie in the subtree?

An unconventional solution would be to write a query for every possible level of nesting and then merge their outputs using theUNION operator. Fortunately, we can use either recursive stored procedures or recursive common table expressions to assist us.

The procedure below returns the IDs of all subnodes of a given node. The second parameter specifies whether the parent node will be included in the resulting set.

CREATEPROCEDURERecTest1(AnIDINTEGER,SelfINTEGER)RETURNS(IDINTEGER)ASBEGINIF(:Self<>0)THENBEGINID=:AnID;SUSPEND;ENDFORSELECTidFROMtest1WHEREparent=:AnIDINTO:IDDOFORSELECTidFROMRecTest1(:ID,1)INTO:IDDOSUSPEND;END;

We can use the procedure on its own orJOIN its output with the tree data to select the required columns:

SELECTt.*FROMtest1tJOINRecTest1(:P,0)rONt.id=r.id

If a specific SQL server supports recursive common table expressions, we can write a query that selects all nested subnodes of the given node:

WITHRECURSIVEtreeAS(SELECTid,parentFROMtest1WHEREparent=<idofthegivennode>UNIONALLSELECTg.id,g.parentFROMtest1gJOINtreehONg.parent=h.id)SELECTgt.id,gt.parentFROMtreegt

Storing auxiliary links

In the previous example, we placed an id-parent link in the table with tree data. Since we only know about the relationship between a given node and its children, we need recursive queries to scan the tree in depth, which is not efficient when working with large amounts of data.

To overcome this limitation, an auxiliary table should be introduced. Such a table holds pairs of connected nodes along with the distance between them, where a distance of 1 indicates a parent-child relationship, and distances greater than 1 correspond to deeper levels of ancestry.

CREATETABLEtest2(idINTEGERNOTNULL,nameVARCHAR(60)NOTNULL,PRIMARYKEY(id));CREATETABLEtest2_link(id_fromINTEGERNOTNULL,id_toINTEGERNOTNULL,distanceINTEGERNOTNULL,PRIMARYKEY(id_from,id_to),FOREIGNKEY(id_from)REFERENCEStest2(id)ONUPDATECASCADEONDELETECASCADE,FOREIGNKEY(id_to)REFERENCEStest2(id)ONUPDATECASCADEONDELETECASCADE);

Here is data for the tree:

id_from	id_to	distance
B	A	1
C	B	1
D	B	1
C	A	2
D	A	2
E	A	1
F	E	1
G	E	1
F	A	2
G	A	2
H	A	1

Let's get back to our business. Selection of the subtree now done much easier. No recursion required.

SELECTt.*FROMtest2tJOINtest2_linklONl.id_from=t.idWHEREl.id_to=:P

Usingdistance field, we can arrange the resulting set by node's closeness to the given root node.

Manipulating Tree Data

While we have managed to avoid recursion when retrieving data, this has significantly complicated the processes of inserting and modifying records.

Adding a Node

To add a node, you need to perform two operations: insert into the tree nodes table and the links table.

For example, to add a child node with an identifier2 to a parent node with an identifier1, execute the following commands:

INSERTINTOtest2(id)VALUES(2);INSERTINTOtest2_link(id_from,id_to,distance)VALUES(2,1,1);

Creating Triggers

To automate the maintenance of tree relationships, it's advisable to create two triggers.

The first trigger fires after inserting a record into thetest2 table and adds a self-referential link with a distance of0 intotest2_link:

CREATETRIGGERtest2_aiFORtest2AFTERINSERTPOSITION32000ASBEGININSERTINTOtest2_link(id_from,id_to,distance)VALUES(NEW.id,NEW.id,0);END;

The second trigger fires after inserting a record into thetest2_link table.

Its purpose is to create links between the new node and its descendants (if adding a subtree) and all ancestors of the parent node:

CREATETRIGGERtest2_link_aiFORtest2_linkAFTERINSERTPOSITION32000ASDECLAREVARIABLEAnIDINTEGER;DECLAREVARIABLEADistanceINTEGER;BEGINIF(NEW.distance=1)THENBEGININSERTINTOtest2_link(id_from,id_to,distance)SELECTdown.id_from,up.id_to,down.distance+up.distance+1FROMtest2_linkup,test2_linkdownWHEREup.id_from=NEW.id_toANDdown.id_to=NEW.id_fromANDdown.distance+up.distance>0;ENDEND;

Moving a Node

To move a node (or subtree) from one parent to another, follow these steps:

Remove the link to the current parent, detaching the subtree:

DELETEFROMtest2_linkWHEREid_from=<IDofthenodeorsubtree>ANDid_to=<IDofthecurrentparent>ANDdistance=1;

Add the subtree to the new parent:

INSERTINTOtest2_link(id_from,id_to,distance)VALUES(<IDofthenodeorsubtree>,<IDofthenewparent>,1);

The previously created insert trigger will handle the new relationships automatically.

Create a trigger that, upon deleting a link with a distance of1, removes all links between the nodes of the subtree and the nodes along the path from the parent node to the tree root:

CREATETRIGGERtest2_link_bdFORtest2_linkBEFOREDELETEPOSITION32000ASDECLAREVARIABLEAnIDFromINTEGER;DECLAREVARIABLEAnIDToINTEGER;BEGINIF(OLD.distance=1)THENBEGINFORSELECTdown.id_from,up.id_toFROMtest2_linkdownJOINtest2_linkupONup.id_from=OLD.id_fromANDup.distance+down.distance>1ANDdown.id_to=OLD.id_fromINTO:AnIDFrom,:AnIDToDODELETEFROMtest2_linkWHEREid_from=:AnIDFromANDid_to=:AnIDTo;ENDEND;

Deleting a Node

Deleting a node doesn't require special handling, as foreign keys in thetest2_link table are configured for cascading deletions.

When a node is deleted from thetest2 table, all related records intest2_link are automatically removed:

DELETEFROMtest2WHEREid=<NodeID>;

Ensuring Data Integrity

To maintain logical data integrity and adherence to our model, create two additional triggers.

The first trigger prevents editing records in thetest2_link table:

CREATEEXCEPTIONtree_e_incorrect_operation'Incorrect operation';CREATETRIGGERtest2_link_auFORtest2_linkBEFOREUPDATEPOSITION32000ASBEGINEXCEPTIONtree_e_incorrect_operation;END;

The second trigger deletes a node if a record with a distance of0 is removed from the links table:

CREATETRIGGERtest2_link_adFORtest2_linkAFTERDELETEPOSITION32000ASBEGINIF(OLD.distance=0)THENBEGINDELETEFROMtest2WHEREid=OLD.id_from;ENDEND;

These measures will help maintain data integrity and ensure the correct tree structure within the database.

Composite Path

It's not necessary to use a separate table to store the list of nodes leading to the root of a tree. Instead, you can concatenate the ordered list of node identifiers into a string, using a chosen delimiter, and store it directly within the node records table. For example:

CREATETABLEtest3(idINTEGERNOTNULL,parentINTEGER,pathVARCHAR(60)NOTNULL,PRIMARYKEY(id),FOREIGNKEY(parent)REFERENCEStest3(id)ONUPDATECASCADEONDELETECASCADE);

To optimize retrieval operations using theSTARTING WITH operator, it's essential to create the following index to enhance performance:

CREATEINDEXtest3_x_pathONtest3(path);

If you choose a forward slash (/) as the delimiter, thetest3 table for the tree depicted in the figure above would be populated as follows:

id	parent	path
A	NULL	/
B	A	/A/
C	B	/A/B/
D	B	/A/B/
E	A	/A/
F	E	/A/E/
G	E	/A/E/
H	A	/A/

Retrieving Subtree Nodes

To retrieve all nodes in the subtree rooted at nodeP, execute the following query:

SELECT*FROMtest3WHEREpathSTARTINGWITH(SELECTpathFROMtest3WHEREid=:P);

If you wish to exclude the root node itself from the results, modify the query accordingly:

SELECT*FROMtest3WHEREpathSTARTINGWITH(SELECTpathFROMtest3WHEREid=:P)ANDid<>:P;

Manipulating Tree Data

Similar to the simplest method of organizing hierarchical data discussed earlier, adding, moving, or deleting a node can be accomplished using standard SQL commands:INSERT,UPDATE, orDELETE.

The automatic generation of the path string can be managed by implementing the following triggers:

CREATETRIGGERtest3_biFORtest3BEFOREINSERTPOSITION32000ASDECLAREVARIABLEPINTEGER;BEGINNEW.path='';P=NEW.parent;WHILE(NOT:PISNULL)DOBEGINNEW.path=CAST(:PASVARCHAR(60))||'/'||NEW.path;SELECTparentFROMtest3WHEREid=:PINTO:P;ENDNEW.path='/'||NEW.path;END;

CREATEEXCEPTIONtree_e_incorrect_operation2'Incorrect operation';CREATETRIGGERtest3_buFORtest3BEFOREUPDATEPOSITION32000ASDECLAREVARIABLEPINTEGER;DECLAREVARIABLETINTEGER;DECLAREVARIABLENPVARCHAR(60);DECLAREVARIABLEOPVARCHAR(60);BEGINIF((NEW.parent<>OLD.parent)OR(NEW.parentISNULLANDNOTOLD.parentISNULL)OR(NOTNEW.parentISNULLANDOLD.parentISNULL))THENBEGINNEW.path='';P=NEW.parent;WHILE(NOT:PISNULL)DOBEGINNEW.path=CAST(:PASVARCHAR(60))||'/'||NEW.path;SELECTparentFROMtest3WHEREid=:PINTO:P;ENDNEW.path='/'||NEW.path;/* Check for circular references */IF(POSITION(('/'||CAST(NEW.idASVARCHAR(60))||'/')INNEW.path)>0)THENEXCEPTIONtree_e_incorrect_operation2;NP=NEW.path||CAST(NEW.idASVARCHAR(60))||'/';OP=OLD.path||CAST(OLD.idASVARCHAR(60))||'/';T=LENGTH(OP)+1;UPDATEtest3SETpath=:NP||SUBSTRING(pathFROM:TFOR60)WHEREpathSTARTINGWITH:OP;ENDEND;

Storing the composite path from the tree root to the current node imposes a limitation on the maximum tree depth, approximately equal toL div (K + 1), whereL is the length of the string field used to store the path, andK is the average length of the string representation of the record key.

Go to thepart 2...