Disjoint-set/Union-find Forest | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Type | multiway tree | ||||||||||||||||||||
Invented | 1964 | ||||||||||||||||||||
Invented by | Bernard A. Galler andMichael J. Fischer | ||||||||||||||||||||
|
Incomputer science, adisjoint-set data structure, also called aunion–find data structure ormerge–find set, is adata structure that stores a collection ofdisjoint (non-overlapping)sets. Equivalently, it stores apartition of a set into disjointsubsets. It provides operations for adding new sets, merging sets (replacing them with theirunion), and finding a representative member of a set. The last operation makes it possible to determine efficiently whether any two elements belong to the same set or to different sets.
While there are several ways of implementing disjoint-set data structures, in practice they are often identified with a particular implementation known as adisjoint-set forest. This specialized type offorest performs union and find operations in near-constantamortized time. For a sequence ofm addition, union, or find operations on a disjoint-set forest withn nodes, the total time required isO(mα(n)), whereα(n) is the extremely slow-growinginverse Ackermann function. Although disjoint-set forests do not guarantee this time per operation, each operation rebalances the structure (via tree compression) so that subsequent operations become faster. As a result, disjoint-set forests are bothasymptotically optimal and practically efficient.
Disjoint-set data structures play a key role inKruskal's algorithm for finding theminimum spanning tree of a graph. The importance of minimum spanning trees means that disjoint-set data structures support a wide variety of algorithms. In addition, these data structures find applications in symbolic computation and in compilers, especially forregister allocation problems.
Disjoint-set forests were first described byBernard A. Galler andMichael J. Fischer in 1964.[2] In 1973, their time complexity was bounded to, theiterated logarithm of, byHopcroft andUllman.[3] In 1975,Robert Tarjan was the first to prove the (inverse Ackermann function) upper bound on the algorithm's time complexity.[4] He also proved it to be tight. In 1979, he showed that this was the lower bound for a certain class of algorithms,pointer algorithms, that include the Galler-Fischer structure.[5] In 1989,Fredman andSaks showed that (amortized) words of bits must be accessed byany disjoint-set data structure per operation,[6] thereby proving the optimality of the data structure in this model.
In 1991, Galil and Italiano published a survey of data structures for disjoint-sets.[7]
In 1994, Richard J. Anderson and Heather Woll described a parallelized version of Union–Find that never needs to block.[8]
In 2007, Sylvain Conchon and Jean-Christophe Filliâtre developed a semi-persistent version of the disjoint-set forest data structure and formalized its correctness using theproof assistantCoq.[9] "Semi-persistent" means that previous versions of the structure are efficiently retained, but accessing previous versions of the data structure invalidates later ones. Their fastest implementation achieves performance almost as efficient as the non-persistent algorithm. They do not perform a complexity analysis.
Variants of disjoint-set data structures with better performance on a restricted class of problems have also been considered. Gabow and Tarjan showed that if the possible unions are restricted in certain ways, then a truly linear time algorithm is possible.[10] In particular, linear time is achievable if a "union tree" is given a priori. This is a tree that includes all elements of the sets. Let p[v] denote the parent in the tree, then the assumption is that union operations must have the formunion(v,p[v]) for somev.
In this and the following section we describe the most common implementation of the disjoint-set data structure, as a forest ofparent pointer trees. This representation is known asGaller-Fischer trees.
Each node in a disjoint-set forest consists of a pointer and some auxiliary information, either a size or a rank (but not both). The pointers are used to makeparent pointer trees, where each node that is not the root of a tree points to its parent. To distinguish root nodes from others, their parent pointers have invalid values, such as a circular reference to the node or a sentinel value. Each tree represents a set stored in the forest, with the members of the set being the nodes in the tree. Root nodes provide set representatives: Two nodes are in the same set if and only if the roots of the trees containing the nodes are equal.
Nodes in the forest can be stored in any way convenient to the application, but a common technique is to store them in an array. In this case, parents can be indicated by their array index. Every array entry requiresΘ(logn) bits of storage for the parent pointer. A comparable or lesser amount of storage is required for the rest of the entry, so the number of bits required to store the forest isΘ(n logn). If an implementation uses fixed size nodes (thereby limiting the maximum size of the forest that can be stored), then the necessary storage is linear inn.
Disjoint-set data structures support three operations: Making a new set containing a new element; Finding the representative of the set containing a given element; and Merging two sets.
TheMakeSet
operation adds a new element into a new set containing only the new element, and the new set is added to the data structure. If the data structure is instead viewed as a partition of a set, then theMakeSet
operation enlarges the set by adding the new element, and it extends the existing partition by putting the new element into a new subset containing only the new element.
In a disjoint-set forest,MakeSet
initializes the node's parent pointer and the node's size or rank. If a root is represented by a node that points to itself, then adding an element can be described using the following pseudocode:
function MakeSet(x)isifx is not already in the forestthenx.parent :=xx.size := 1// if nodes store sizex.rank := 0// if nodes store rankend ifend function
This operation has linear time complexity. In particular, initializing adisjoint-set forest withn nodes requiresO(n)time.
Lack of a parent assigned to the node implies that the node is not present in the forest.
In practice,MakeSet
must be preceded by an operation that allocates memory to holdx. As long as memory allocation is an amortized constant-time operation, as it is for a gooddynamic array implementation, it does not change the asymptotic performance of the random-set forest.
TheFind
operation follows the chain of parent pointers from a specified query nodex until it reaches a root element. This root element represents the set to whichx belongs and may bex itself.Find
returns the root element it reaches.
Performing aFind
operation presents an important opportunity for improving the forest. The time in aFind
operation is spent chasing parent pointers, so a flatter tree leads to fasterFind
operations. When aFind
is executed, there is no faster way to reach the root than by following each parent pointer in succession. However, the parent pointers visited during this search can be updated to point closer to the root. Because every element visited on the way to a root is part of the same set, this does not change the sets stored in the forest. But it makes futureFind
operations faster, not only for the nodes between the query node and the root, but also for their descendants. This updating is an important part of the disjoint-set forest's amortized performance guarantee.
There are several algorithms forFind
that achieve the asymptotically optimal time complexity. One family of algorithms, known aspath compression, makes every node between the query node and the root point to the root. Path compression can be implemented using a simple recursion as follows:
function Find(x)isifx.parent ≠xthenx.parent := Find(x.parent)returnx.parentelsereturnxend ifend function
This implementation makes two passes, one up the tree and one back down. It requires enough scratch memory to store the path from the query node to the root (in the above pseudocode, the path is implicitly represented using the call stack). This can be decreased to a constant amount of memory by performing both passes in the same direction. The constant memory implementation walks from the query node to the root twice, once to find the root and once to update pointers:
function Find(x)isroot :=xwhileroot.parent ≠rootdoroot :=root.parentend whilewhilex.parent ≠rootdoparent :=x.parentx.parent :=rootx :=parentend whilereturnrootend function
Tarjan andVan Leeuwen also developed one-passFind
algorithms that retain the same worst-case complexity but are more efficient in practice.[4] These are called path splitting and path halving. Both of these update the parent pointers of nodes on the path between the query node and the root.Path splitting replaces every parent pointer on that path by a pointer to the node's grandparent:
function Find(x)iswhilex.parent ≠xdo (x,x.parent) := (x.parent,x.parent.parent)end whilereturnxend function
Path halving works similarly but replaces only every other parent pointer:
function Find(x)iswhilex.parent ≠xdox.parent :=x.parent.parentx :=x.parentend whilereturnxend function
MakeSet
creates 8 singletons.Union
, some sets are grouped together.The operationUnion(x,y)
replaces the set containingx and the set containingy with their union.Union
first usesFind
to determine the roots of the trees containingx andy. If the roots are the same, there is nothing more to do. Otherwise, the two trees must be merged. This is done by either setting the parent pointer ofx's root toy's, or setting the parent pointer ofy's root tox's.
The choice of which node becomes the parent has consequences for the complexity of future operations on the tree. If it is done carelessly, trees can become excessively tall. For example, suppose thatUnion
always made the tree containingx a subtree of the tree containingy. Begin with a forest that has just been initialized with elements and executeUnion(1, 2)
,Union(2, 3)
, ...,Union(n - 1,n)
. The resulting forest contains a single tree whose root isn, and the path from 1 ton passes through every node in the tree. For this forest, the time to runFind(1)
isO(n).
In an efficient implementation, tree height is controlled usingunion by size orunion by rank. Both of these require a node to store information besides just its parent pointer. This information is used to decide which root becomes the new parent. Both strategies ensure that trees do not become too deep.
In the case of union by size, a node stores its size, which is simply its number of descendants (including the node itself). When the trees with rootsx andy are merged, the node with more descendants becomes the parent. If the two nodes have the same number of descendants, then either one can become the parent. In both cases, the size of the new parent node is set to its new total number of descendants.
function Union(x,y)is// Replace nodes by rootsx := Find(x)y := Find(y)ifx =ythenreturn// x and y are already in the same setend if// If necessary, swap variables to ensure that// x has at least as many descendants as yifx.size <y.sizethen (x,y) := (y,x)end if// Make x the new rooty.parent :=x// Update the size of xx.size :=x.size +y.sizeend function
The number of bits necessary to store the size is clearly the number of bits necessary to storen. This adds a constant factor to the forest's required storage.
For union by rank, a node stores itsrank, which is an upper bound for its height. When a node is initialized, its rank is set to zero. To merge trees with rootsx andy, first compare their ranks. If the ranks are different, then the larger rank tree becomes the parent, and the ranks ofx andy do not change. If the ranks are the same, then either one can become the parent, but the new parent's rank is incremented by one. While the rank of a node is clearly related to its height, storing ranks is more efficient than storing heights. The height of a node can change during aFind
operation, so storing ranks avoids the extra effort of keeping the height correct. In pseudocode, union by rank is:
function Union(x,y)is// Replace nodes by rootsx := Find(x)y := Find(y)ifx =ythenreturn// x and y are already in the same setend if// If necessary, rename variables to ensure that// x has rank at least as large as that of yifx.rank <y.rankthen (x,y) := (y,x)end if// Make x the new rooty.parent :=x// If necessary, increment the rank of xifx.rank =y.rankthenx.rank :=x.rank + 1end ifend function
It can be shown that every node has rank or less.[11] Consequently each rank can be stored inO(log logn) bits and all the ranks can be stored inO(n log logn) bits. This makes the ranks an asymptotically negligible portion of the forest's size.
It is clear from the above implementations that the size and rank of a node do not matter unless a node is the root of a tree. Once a node becomes a child, its size and rank are never accessed again.
There is a variant of theUnion
operation in which the user determines the representative of the formed set. It is not hard to add this functionality to the above algorithms without losing efficiency.
A disjoint-set forest implementation in whichFind
does not update parent pointers, and in whichUnion
does not attempt to control tree heights, can have trees with heightO(n). In such a situation, theFind
andUnion
operations requireO(n) time.
If an implementation uses path compression alone, then a sequence ofnMakeSet
operations, followed by up ton − 1Union
operations andfFind
operations, has a worst-case running time of.[11]
Using union by rank, but without updating parent pointers duringFind
, gives a running time of form operations of any type, up ton of which areMakeSet
operations.[11]
The combination of path compression, splitting, or halving, with union by size or by rank, reduces the running time form operations of any type, up ton of which areMakeSet
operations, to.[4][5] This makes theamortized running time of each operation. This is asymptotically optimal, meaning that every disjoint set data structure must use amortized time per operation.[6] Here, the function is theinverse Ackermann function. The inverse Ackermann function grows extraordinarily slowly, so this factor is4 or less for anyn that can actually be written in the physical universe. This makes disjoint-set operations practically amortized constant time.
The precise analysis of the performance of a disjoint-set forest is somewhat intricate. However, there is a much simpler analysis that proves that the amortized time for anymFind
orUnion
operations on a disjoint-set forest containingn objects isO(m log*n), wherelog* denotes theiterated logarithm.[12][13][14][15]
Lemma 1: As thefind function follows the path along to the root, the rank of node it encounters is increasing.
We claim that as Find and Union operations are applied to the data set, this fact remains true over time. Initially when each node is the root of its own tree, it's trivially true. The only case when the rank of a node might be changed is when theUnion by Rank operation is applied. In this case, a tree with smaller rank will be attached to a tree with greater rank, rather than vice versa. And during the find operation, all nodes visited along the path will be attached to the root, which has larger rank than its children, so this operation won't change this fact either.
Lemma 2: A nodeu which is root of a subtree with rankr has at least nodes.
Initially when each node is the root of its own tree, it's trivially true. Assume that a nodeu with rankr has at least2r nodes. Then when two trees with rankr are merged using the operationUnion by Rank, a tree with rankr + 1 results, the root of which has at least nodes.
Lemma 3: The maximum number of nodes of rankr is at most
Fromlemma 2, we know that a nodeu which is root of a subtree with rankr has at least nodes. We will get the maximum number of nodes of rankr when each node with rankr is the root of a tree that has exactly nodes. In this case, the number of nodes of rankr is
At any particular point in the execution, we can group the vertices of the graph into "buckets", according to their rank. We define the buckets' ranges inductively, as follows: Bucket 0 contains vertices of rank 0. Bucket 1 contains vertices of rank 1. Bucket 2 contains vertices of ranks 2 and 3. In general, if theB-th bucket contains vertices with ranks from interval, then the (B+1)st bucket will contain vertices with ranks from interval
For, let. Thenbucket will have vertices with ranks in the interval.
We can make two observations about the buckets' sizes.
LetF represent the list of "find" operations performed, and let
Then the total cost ofm finds is
Since each find operation makes exactly one traversal that leads to a root, we haveT1 =O(m).
Also, from the bound above on the number of buckets, we haveT2 =O(mlog*n).
ForT3, suppose we are traversing an edge fromu tov, whereu andv have rank in the bucket[B, 2B − 1] andv is not the root (at the time of this traversing, otherwise the traversal would be accounted for inT1). Fixu and consider the sequence that take the role ofv in different find operations. Because of path compression and not accounting for the edge to a root, this sequence contains only different nodes and because ofLemma 1 we know that the ranks of the nodes in this sequence are strictly increasing. By both of the nodes being in the bucket we can conclude that the lengthk of the sequence (the number of times nodeu is attached to a different root in the same bucket) is at most the number of ranks in the bucketsB, that is, at most
Therefore,
From Observations1 and2, we can conclude that
Therefore,
The worst-case time of theFind
operation in trees withUnion by rank orUnion by weight is (i.e., it is and this bound is tight). In 1985, N. Blum gave an implementation of the operations that does not use path compression, but compresses trees during. His implementation runs in time per operation,[16] and thus in comparison with Galler and Fischer's structure it has a better worst-case time per operation, but inferior amortized time. In 1999, Alstrup et al. gave a structure that has optimal worst-casetime together with inverse-Ackermann amortized time.[17]
The regular implementation as disjoint-set forests does not react favorably to the deletion of elements,in the sense that the time forFind
will not improve as a result of the decrease in the number of elements. However, there exist modern implementations that allow for constant-time deletion and where the time-bound forFind
depends on thecurrent number of elements[18][19]
It is possible to extend certain disjoint-set forest structures to allow backtracking. The basic form of backtracking is to allow aBacktrack(1)
operation, that undoes the lastUnion
. A more advanced form allowsBacktrack(i)
,which undoes the last i unions. The following complexity result is known: there is a data structure which supportsUnion
andFind
in time per operation, andBacktrack
in time.[20]. In this result, the freedom ofUnion
to choose the representative of the formed set is essential.Better amortized time cannot be achieved within the class of separablepointer algorithms[20].
Disjoint-set data structures model thepartitioning of a set, for example to keep track of theconnected components of anundirected graph. This model can then be used to determine whether two vertices belong to the same component, or whether adding an edge between them would result in a cycle. The Union–Find algorithm is used in high-performance implementations ofunification.[21]
This data structure is used by theBoost Graph Library to implement itsIncremental Connected Components functionality. It is also a key component in implementingKruskal's algorithm to find theminimum spanning tree of a graph.
TheHoshen-Kopelman algorithm uses a Union-Find in the algorithm.
Theorem 5: Any CPROBE(logn) implementation of the set union problem requires Ω(m α(m,n)) time to executem Find's andn−1 Union's, beginning withn singleton sets.