Trees I: Radix trees

[Posted March 13, 2006 by corbet]

The kernel includes a number of library routines for the implementation ofuseful data structures. Among those are two types of trees: radix treesand red-black trees. This article will have a look at the radix tree API,with red-black trees to follow in the future.

Wikipedia hasa radixtree article, but Linux radix trees are not well described by thatarticle. A Linux radix tree is a mechanism by which a (pointer) value canbe associated with a (long) integer key. It is reasonably efficient in terms ofstorage, and is quite quick on lookups. Additionally, radix trees in theLinux kernel have some features driven by kernel-specific needs, includingthe ability to associate tags with specific entries.

[radix tree node] The cheesy diagram on the right shows a leaf node from a Linux radix tree.The node contains a number of slots, each of which can contain a pointer tosomething of interest to the creator of the tree. Empty slots contain aNULL pointer. These trees are quite broad - in the 2.6.16-rckernels, there are 64 slots in each radix tree node. Slots are indexed bya portion of the (long) integer key. If the highest key value is less than64, the entire tree can be represented with a single node.Normally, however, a rather larger set of keys is in use - otherwise, asimple array could have been used. So a larger tree might look somethinglike this:

This tree is three levels deep. When the kernel goes to look up a specifickey, the most significant six bits will be used to find the appropriateslot in the root node. The next six bits then index the slot in the middlenode, and the least significant six bits will indicate the slot containing apointer to the actual value. Nodes which have no children are not presentin the tree, so a radix tree can provide efficient storage for sparsetrees.

Radix trees have a few users in the mainline kernel tree. The PowerPCarchitecture uses a tree to map between real and virtual IRQ numbers. TheNFS code attaches a tree to relevantinode structures to keeptrack of outstanding requests. The most widespread use of radix trees,however, is in the memory management code. Theaddress_spacestructure used to keep track of backing store contains a radix tree whichtracks in-core pages tied to that mapping. Among other things, this treeallows the memory management code to quickly find pages which are dirty orunder writeback.

As is typical with kernel data structures, there are two modes fordeclaring and initializing radix trees:

    #include <linux/radix-tree.h>    RADIX_TREE(name, gfp_mask);  /* Declare and initialize */    struct radix_tree_root my_tree;    INIT_RADIX_TREE(my_tree, gfp_mask);

The first form declares and initializes a radix tree with the givenname; the second form performs the initialization at run time. Ineither case, agfp_mask must be provided to tell the code howmemory allocations are to be performed. If radix tree operations(insertions, in particular) are to be performed in atomic context, thegiven mask should beGFP_ATOMIC.

The functions for adding and removing entries are straightforward:

    int radix_tree_insert(struct radix_tree_root *tree, unsigned long key,                           void *item);    void *radix_tree_delete(struct radix_tree_root *tree, unsigned long key);

A call toradix_tree_insert() will cause the givenitemto be inserted (associated withkey) in the giventree. Thisoperation may require memory allocations; should an allocation fail, theinsertion will fail and the return value will be-ENOMEM. Thecode will refuse to overwrite an existing entry; ifkey alreadyexists in the tree,radix_tree_insert() will return-EEXIST. On success, the return value is zero.radix_tree_delete() removes the item associated withkeyfromtree, returning a pointer to that item if it was present.

There are situations where failure to insert an item into a radix tree canbe a significant problem. To help avoid such situations, a pair of specializedfunctions are provided:

    int radix_tree_preload(gfp_t gfp_mask);    void radix_tree_preload_end(void);

This function will attempt to allocate sufficient memory (using the givengfp_mask) to guarantee that the next radix tree insertion cannotfail. The allocated structures are stored in a per-CPU variable, meaningthat the calling function must perform the insertion before it can scheduleor be moved to a different processor. To that end,radix_tree_preload() will, when successful, return with preemptiondisabled; the caller must eventually ensure that preemption is enabledagain by callingradix_tree_preload_end(). On failure,-ENOMEM is returned and preemption isnot disabled.

Radix tree lookups can be done in a few ways:

    void *radix_tree_lookup(struct radix_tree_root *tree, unsigned long key);    void **radix_tree_lookup_slot(struct radix_tree_root *tree, unsigned long key);    unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,                                         void **results,unsigned long first_index, unsigned int max_items);

The simplest form,radix_tree_lookup(), looks forkey inthetree and returns the associated item (orNULL onfailure).radix_tree_lookup_slot() will, instead, return apointer to the slot holding the pointer to the item. The caller can, then,change the pointer to associate a new item with thekey. If theitem does not exist, however,radix_tree_lookup_slot() will notcreate a slot for it, so this function cannot be used in place ofradix_tree_insert().

Finally, a call toradix_tree_gang_lookup() will return up tomax_items items inresults, with ascending key valuesstarting atfirst_index. The number of items returned may be lessthan requested, but a short return (other than zero) does not imply thatthere are no more values in the tree.

One should note that none of the radix treefunctions perform any sort of locking internally. It is up to the callerto ensure that multiple threads do not corrupt the tree or get into othersorts of unpleasant race conditions. Nick Piggin currently has a patchcirculating which would use read-copy-update to free tree nodes; this patchwould allow lookup operations to be performed without locking as long as(1) the resulting pointer is only used in atomic context, and(2) the calling code avoids creating race conditions of its own. Itis not clear when that patch might be merged, however.

The radix tree code supports a feature called "tags," wherein specific bitscan be set on items in the tree. Tags are used, for example, to markmemory pages which are dirty or under writeback. The API for working withtags is:

    void *radix_tree_tag_set(struct radix_tree_root *tree,unsigned long key, int tag);    void *radix_tree_tag_clear(struct radix_tree_root *tree,unsigned long key, int tag);    int radix_tree_tag_get(struct radix_tree_root *tree,unsigned long key, int tag);

radix_tree_tag_set() will set the giventag on the itemindexed bykey; it is an error to attempt to set a tag on anonexistent key. The return value will be a pointer to the tagged item.Whiletag looks like an arbitrary integer, thecode as currently written allows for a maximum of two tags. Use of any tagvalue other than zero or one will silently corrupt memory in someundesirable place; consider yourself warned.

Tags can be removed withradix_tree_tag_clear(); once again, thereturn value is a pointer to the (un)tagged item. The functionradix_tree_tag_get() will check whether the item indexed bykey has the giventag set; the return value is zero ifkey is not present, -1 ifkey is present buttagis not set, and +1 otherwise. This function is currently commented out inthe source, however, since no in-tree code uses it.

There are two other functions for querying tags:

    int radix_tree_tagged(struct radix_tree_root *tree, int tag);    unsigned int radix_tree_gang_lookup_tag(struct radix_tree_root *tree,                                             void **results,    unsigned long first_index,     unsigned int max_items,     int tag);

radix_tree_tagged() returns a non-zero value if any item in thetree bears the giventag. A list of items with a given tag can beobtained withradix_tree_gang_lookup_tag().

In concluding, we can note one other interesting aspect of the radix treeAPI: there is no function for destroying a radix tree. It is, evidently,assumed that radix trees will last forever. In practice, deleting allitems from a radix tree will free all memory associated with it other thanthe root node, which can then be disposed of normally.

Index entries for this article
Kernel	Radix tree

Node size and cache-line ping pong on SMP

Posted Mar 17, 2006 21:27 UTC (Fri) byjzbiciak (guest, #5246) [Link]

I wonder what's magic about 64, other than that it's 6 bits? On a 32-bit architecture, that's 256 bytes of storage for the 64 pointers. On a machine like the Athlon, cache lines are 64 bytes, which means each node of the tree occupies 4 cache lines. In an SMP context, is there an advantage to sizing the nodes down to occupy a single cache line, or is there a sort of hashing benefit to having each node require multiple cache lines? e.g. Would I get more pingponging or less if the nodes were smaller? Does the answer change based on # of CPUs?

Trees I: Radix trees

Posted Mar 19, 2006 8:09 UTC (Sun) byncm (guest, #165) [Link] (4 responses)

The designer of Judy trees has expressed wonder at why anybody ever talks about binary trees (e.g. red-black trees) any more, since his measurements indicate that no matter what you do, their performance always stinks compared to modern cache-aware techniques. The only reasonable explanation is the same as the reason university graduates, once, all knew ancient Greek, and had studied geometry but not calculus.

Trees I: Radix trees

Posted Jun 30, 2006 23:22 UTC (Fri) bywahern (subscriber, #37304) [Link]

The author/inventor put it well himself when he said, "Well I cannot describe Judy in 10 minutes -- what possessed me?"

Source:http://judy.sourceforge.net/doc/10minutes.htm

Simplicity is often a very valuable quality, especially in software development.

Judy licensing

Posted Jul 18, 2007 7:59 UTC (Wed) byiler (guest, #46313) [Link] (1 responses)

Another reason, besides simlpicity, is licensing.

Is Judy GPLed ?

Judy licensing

Posted Oct 23, 2008 21:31 UTC (Thu) bybcl (subscriber, #17631) [Link]

LGPL according to the article linked above

Trees I: Radix trees

Posted Jul 22, 2013 10:25 UTC (Mon) bypuchuu (guest, #92032) [Link]

Judy author sold himself and his ideas to HP. Why linux should collect HP patents? (see USA patent 6735595)

Also the fastest hash table in the world is "burst trie", not judy arrays. Linux can use fork of "burst trie" idea named "hat trie" in future. It is more memory efficient. These ideas are free, no patents.

Trees I: Radix trees

Posted May 28, 2015 18:44 UTC (Thu) byaaabr (guest, #102856) [Link]

How well would this apply to large addresses (i.e IPv6)? It has 64 slots, so will it be able to hold addresses much larger than 64 bits? Thank you

Trees I: Radix trees

Posted Sep 25, 2019 19:37 UTC (Wed) byNaarayanan (guest, #134635) [Link]

How do you iterate/traverse through a radix tree and print out its contents to the console. Can we use gang_lookup() or is there any other API for that?

Movatterモバイル変換

Trees I: Radix trees

Node size and cache-line ping pong on SMP

Trees I: Radix trees

Trees I: Radix trees

Judy licensing

Judy licensing

Trees I: Radix trees

Trees I: Radix trees

Trees I: Radix trees