Movatterモバイル変換


[0]ホーム

URL:


LWN.net LogoLWN
.net
News from the source
LWN
|
|
Log in /Subscribe /Register

What is RCU, Fundamentally?

Benefits for LWN subscribers

The primary benefit fromsubscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

December 17, 2007

This article was contributed by Paul McKenney

[Editor's note: this is the first in a three-part series on how theread-copy-update mechanism works. Many thanks to Paul McKenney andJonathan Walpole for allowing us to publish these articles. The remainingtwo sections will appear in future weeks.]

Part 1 of 3 ofWhat is RCU, Really?

Paul E. McKenney, IBM Linux Technology Center
Jonathan Walpole, Portland State University Department of ComputerScience

Introduction

Read-copy update (RCU) is a synchronization mechanism that was added tothe Linux kernel in October of 2002.RCU achieves scalabilityimprovements by allowing reads to occur concurrently with updates.In contrast with conventional locking primitives that ensure mutual exclusionamong concurrent threads regardless of whether they be readers orupdaters, or with reader-writer locks that allow concurrent reads but not inthe presence of updates, RCU supports concurrency between a singleupdater and multiple readers.RCU ensures that reads are coherent bymaintaining multiple versions of objects and ensuring that they are notfreed up until all pre-existing read-side critical sections complete.RCU defines and uses efficient and scalable mechanisms for publishingand reading new versions of an object, and also for deferring the collectionof old versions.These mechanisms distribute the work among read andupdate paths in such a way as to make read paths extremely fast. In somecases (non-preemptable kernels), RCU's read-side primitives have zerooverhead.

Quick Quiz 1:But doesn't seqlock also permit readers and updaters to get work doneconcurrently?

This leads to the question "what exactly is RCU?", and perhaps alsoto the question "how can RCUpossibly work?" (or, notinfrequently, the assertion that RCU cannot possibly work).This document addresses these questions from a fundamental viewpoint;later installments look at them from usage and from API viewpoints.This last installment also includes a list of references.

RCU is made up of three fundamental mechanisms, the first beingused for insertion, the second being used for deletion, and the thirdbeing used to allow readers to tolerate concurrent insertions and deletions.These mechanisms are described in the following sections, which focuson applying RCU to linked lists:

  1. Publish-Subscribe Mechanism (for insertion)
  2. Wait For Pre-Existing RCU Readers to Complete (for deletion)
  3. Maintain Multiple Versions of Recently Updated Objects(for readers)

These sections are followed byconcluding remarks and theanswers to the Quick Quizzes.

Publish-Subscribe Mechanism

One key attribute of RCU is the ability to safely scan data, eventhough that data is being modified concurrently.To provide this ability for concurrent insertion,RCU uses what can be thought of as a publish-subscribe mechanism.For example, consider an initiallyNULL global pointergp that is to be modified to point to a newly allocatedand initialized data structure.The following code fragment (with the addition of appropriate locking)might be used for this purpose:

  1 struct foo {  2   int a;  3   int b;  4   int c;  5 };  6 struct foo *gp = NULL;  7   8 /* . . . */  9  10 p = kmalloc(sizeof(*p), GFP_KERNEL); 11 p->a = 1; 12 p->b = 2; 13 p->c = 3; 14 gp = p;

Unfortunately, there is nothing forcing the compiler and CPU to executethe last four assignment statements in order.If the assignment togp happens before the initializationofp's fields, then concurrent readers could see theuninitialized values.Memory barriers are required to keep things ordered, but memory barriersare notoriously difficult to use.We therefore encapsulate them into a primitivercu_assign_pointer() that has publication semantics.The last four lines would then be as follows:

  1 p->a = 1;  2 p->b = 2;  3 p->c = 3;  4 rcu_assign_pointer(gp, p);

Thercu_assign_pointer()wouldpublish the new structure, forcing both the compilerand the CPU to execute the assignment togpafterthe assignments to the fields referenced byp.

However, it is not sufficient to only enforce ordering at theupdater, as the reader must enforce proper ordering as well.Consider for example the following code fragment:

  1 p = gp;  2 if (p != NULL) {  3   do_something_with(p->a, p->b, p->c);  4 }

Although this code fragment might well seem immune to misordering,unfortunately, theDECAlpha CPU [PDF]and value-speculation compiler optimizations can, believe it or not,cause the values ofp->a,p->b, andp->c to be fetched before the value ofp!This is perhaps easiest to see in the case of value-speculationcompiler optimizations, where the compiler guesses the valueofp, fetchesp->a,p->b, andp->c, then fetches the actual value ofpin order to check whether its guess was correct.This sort of optimization is quite aggressive, perhaps insanely so,but does actually occur in the context of profile-driven optimization.

Clearly, we need to prevent this sort of skullduggery on thepart of both the compiler and the CPU.Thercu_dereference() primitive useswhatever memory-barrier instructions and compilerdirectives are required for this purpose:

  1 rcu_read_lock();  2 p = rcu_dereference(gp);  3 if (p != NULL) {  4   do_something_with(p->a, p->b, p->c);  5 }  6 rcu_read_unlock();

Thercu_dereference() primitive can thus be thought ofassubscribing to a given value of the specified pointer,guaranteeing that subsequent dereference operations will see anyinitialization that occurred before the corresponding publish(rcu_assign_pointer()) operation.Thercu_read_lock() andrcu_read_unlock()calls are absolutely required: they define the extent of theRCU read-side critical section.Their purpose is explained in thenext section,however, they never spin or block, nor do they prevent thelist_add_rcu() from executing concurrently.In fact, in non-CONFIG_PREEMPT kernels, they generateabsolutely no code.

Althoughrcu_assign_pointer() andrcu_dereference() can in theory be used to construct anyconceivable RCU-protected data structure, in practice it is often betterto use higher-level constructs.Therefore, thercu_assign_pointer() andrcu_dereference()primitives have been embedded in special RCU variants of Linux'slist-manipulation API.Linux has two variants of doubly linked list, the circularstruct list_head and the linearstruct hlist_head/struct hlist_node pair.The former is laid out as follows, where the green boxes representthe list header and the blue boxes represent the elements in thelist.

Linux list

Adapting the pointer-publish example for the linked list givesthe following:

  1 struct foo {  2   struct list_head list;  3   int a;  4   int b;  5   int c;  6 };  7 LIST_HEAD(head);  8   9 /* . . . */ 10  11 p = kmalloc(sizeof(*p), GFP_KERNEL); 12 p->a = 1; 13 p->b = 2; 14 p->c = 3; 15 list_add_rcu(&p->list, &head);

Line 15 must be protected by some synchronization mechanism (mostcommonly some sort of lock) to prevent multiplelist_add()instances from executing concurrently.However, such synchronization does not prevent thislist_add()from executing concurrently with RCU readers.

Subscribing to an RCU-protected list is straightforward:

  1 rcu_read_lock();  2 list_for_each_entry_rcu(p, head, list) {  3   do_something_with(p->a, p->b, p->c);  4 }  5 rcu_read_unlock();

Thelist_add_rcu() primitive publishesan entry into the specified list, guaranteeing that the correspondinglist_for_each_entry_rcu() invocation will properlysubscribe to this same entry.

Quick Quiz 2:What prevents thelist_for_each_entry_rcu() fromgetting a segfault if it happens to execute at exactly the sametime as thelist_add_rcu()?

Linux's other doubly linked list, the hlist,is a linear list, which means thatit needs only one pointer for the header rather than the tworequired for the circular list.Thus, use of hlist can halve the memory consumption for the hash-bucketarrays of large hash tables.

Linux hlist

Publishing a new element to an RCU-protected hlist is quite similarto doing so for the circular list:

  1 struct foo {  2   struct hlist_node *list;  3   int a;  4   int b;  5   int c;  6 };  7 HLIST_HEAD(head);  8   9 /* . . . */ 10  11 p = kmalloc(sizeof(*p), GFP_KERNEL); 12 p->a = 1; 13 p->b = 2; 14 p->c = 3; 15 hlist_add_head_rcu(&p->list, &head);

As before, line 15 must be protected by some sort of synchronizationmechanism, for example, a lock.

Subscribing to an RCU-protected hlist is also similar to thecircular list:

  1 rcu_read_lock();  2 hlist_for_each_entry_rcu(p, q, head, list) {  3   do_something_with(p->a, p->b, p->c);  4 }  5 rcu_read_unlock();

Quick Quiz 3:Why do we need to pass two pointers intohlist_for_each_entry_rcu()when only one is needed forlist_for_each_entry_rcu()?

The set of RCU publish and subscribe primitives are shownin the following table, along with additional primitives to"unpublish", or retract:

CategoryPublishRetractSubscribe
Pointersrcu_assign_pointer()rcu_assign_pointer(..., NULL)rcu_dereference()
Listslist_add_rcu()
list_add_tail_rcu()
list_replace_rcu()
list_del_rcu()list_for_each_entry_rcu()
Hlistshlist_add_after_rcu()
hlist_add_before_rcu()
hlist_add_head_rcu()
hlist_replace_rcu()
hlist_del_rcu()hlist_for_each_entry_rcu()

Note that thelist_replace_rcu(),list_del_rcu(),hlist_replace_rcu(), andhlist_del_rcu()APIs add a complication.When is it safe to free up the data element that was replaced orremoved?In particular, how can we possibly know when all the readershave released their references to that data element?

These questions are addressed in the following section.

Wait For Pre-Existing RCU Readers to Complete

In its most basic form, RCU is a way of waiting for things to finish.Of course, there are a great many other ways of waiting for things tofinish, including reference counts, reader-writer locks, events, and so on.The great advantage of RCU is that it can wait for each of(say) 20,000 different things without having to explicitlytrack each and every one of them, and without having to worry aboutthe performance degradation, scalability limitations, complex deadlockscenarios, and memory-leak hazards that are inherent in schemesusing explicit tracking.

In RCU's case, the things waited on are called"RCU read-side critical sections".An RCU read-side critical section starts with anrcu_read_lock() primitive, and ends with a correspondingrcu_read_unlock() primitive.RCU read-side critical sections can be nested, and may contain prettymuch any code, as long as that code does not explicitly block or sleep(although a special form of RCU called"SRCU"does permit general sleeping in SRCU read-side critical sections).If you abide by these conventions, you can use RCU to wait foranydesired piece of code to complete.

RCU accomplishes this feat by indirectly determining when theseother things have finished, as has been described elsewhere forRCU Classic andrealtime RCU.

In particular, as shown in the following figure, RCU is a way ofwaiting for pre-existing RCU read-side critical sections to completelyfinish, including memory operations executed by those critical sections.

Graceperiods extend to contain pre-existing RCU read-side critical sections.

However, note that RCU read-side critical sectionsthat begin after the beginningof a given grace period can and will extend beyond the end of that graceperiod.

The following pseudocode shows the basic form of algorithms that useRCU to wait for readers:

  1. Make a change, for example, replace an element in a linked list.

  2. Wait for all pre-existing RCU read-side critical sections tocompletely finish (for example, by using thesynchronize_rcu() primitive).The key observation here is that subsequent RCU read-side criticalsections have no way to gain a reference to the newly removedelement.

  3. Clean up, for example, free the element that was replaced above.

The following code fragment, adapted from those in theprevious section,demonstrates this process, with fielda being the search key:

  1 struct foo {  2   struct list_head list;  3   int a;  4   int b;  5   int c;  6 };  7 LIST_HEAD(head);  8   9 /* . . . */ 10  11 p = search(head, key); 12 if (p == NULL) { 13   /* Take appropriate action, unlock, and return. */ 14 } 15 q = kmalloc(sizeof(*p), GFP_KERNEL); 16 *q = *p; 17 q->b = 2; 18 q->c = 3; 19 list_replace_rcu(&p->list, &q->list); 20 synchronize_rcu(); 21 kfree(p);

Lines 19, 20, and 21 implement the three steps called out above.Lines 16-19 gives RCU ("read-copy update") its name: while permittingconcurrentreads, line 16copies and lines 17-19do anupdate.

Thesynchronize_rcu() primitive might seem a bitmysterious at first.After all, it must wait for all RCU read-side critical sections tocomplete, and, as we saw earlier, thercu_read_lock() andrcu_read_unlock() primitivesthat delimit RCU read-side critical sections don't even generate anycode in non-CONFIG_PREEMPT kernels!

There is a trick, and the trick is that RCU Classic read-side criticalsections delimited byrcu_read_lock() andrcu_read_unlock() are not permitted to block or sleep.Therefore, when a given CPU executes a context switch, we are guaranteedthat any prior RCU read-side critical sections will have completed.This means that as soon as eachCPU has executed at least one context switch,allprior RCU read-side critical sections are guaranteed to have completed,meaning thatsynchronize_rcu() can safely return.

Thus, RCU Classic'ssynchronize_rcu()can conceptually be as simple as the following:

  1 for_each_online_cpu(cpu)  2   run_on(cpu);

Here,run_on() switches the current thread to thespecified CPU, which forces a context switch on that CPU.Thefor_each_online_cpu() loop therefore forces acontext switch on each CPU, thereby guaranteeing that all priorRCU read-side critical sections have completed, as required.Although this simple approach works for kernels in which preemptionis disabled across RCU read-side critical sections, in otherwords, for non-CONFIG_PREEMPT andCONFIG_PREEMPTkernels, it doesnot work forCONFIG_PREEMPT_RTrealtime (-rt) kernels.Therefore,realtime RCU usesa different approach based loosely on reference counters.

Of course, the actual implementation in the Linux kernelis much more complex, as it is requiredto handle interrupts, NMIs, CPU hotplug, and other hazards ofproduction-capable kernels, but while also maintaining good performance andscalability.Realtime implementations of RCU must additionally help provide goodrealtime response, which rules out implementations (like the simpletwo-liner above) that rely on disabling preemption.

Although it is good to know that there is a simple conceptualimplementation ofsynchronize_rcu(), other questions remain.For example, what exactly do RCUreaders see when traversing a concurrently updated list?This question is addressed in the following section.

Maintain Multiple Versions of Recently Updated Objects

This section demonstrates how RCU maintains multiple versions oflists to accommodate synchronization-free readers.Two examples are presented showing how an elementthat might be referenced by a given reader must remain intactwhile that reader remains in its RCU read-side critical section.The first example demonstrates deletion of a list element,and the second example demonstrates replacement of an element.

Example 1: Maintaining Multiple Versions During Deletion

To start the "deletion" example,we will modify lines 11-21 in theexample in the previous sectionas follows:

  1 p = search(head, key);  2 if (p != NULL) {  3   list_del_rcu(&p->list);  4   synchronize_rcu();  5   kfree(p);  6 }

The initial state of the list, including the pointerp,is as follows.

Initial liststate.

The triples in each element represent the values of fieldsa,b, andc, respectively.The red borders oneach element indicate that readers might be holding references to them,and because readers do not synchronize directly with updaters,readers might run concurrently with this entire replacement process.Please note that we have omitted the backwards pointers and the link from the tailof the list to the head for clarity.

After thelist_del_rcu() online 3 has completed, the5,6,7 elementhas been removed from the list, as shown below.Since readers do not synchronize directly with updaters,readers might be concurrently scanning this list.These concurrent readers might or might not see the newly removed element,depending on timing.However, readers that were delayed (e.g., due to interrupts, ECC memoryerrors, or, inCONFIG_PREEMPT_RT kernels, preemption)just after fetching a pointer to the newly removed element mightsee the old version of the list for quite some time after theremoval.Therefore, we now have two versions of the list, one with element5,6,7 and one without.The border of the5,6,7 element isstill red, indicatingthat readers might be referencing it.

Afterdeletion.

Please note that readers are not permitted to maintain references toelement 5,6,7 after exiting from their RCU read-sidecritical sections.Therefore,once thesynchronize_rcu() online 4 completes, so that all pre-existing readers areguaranteed to have completed,there can be no more readers referencing thiselement, as indicated by its black border below.We are thus back to a single version of the list.

After deletion.

At this point, the5,6,7 element may safely befreed, as shown below:

After deletion.

At this point, we have completed the deletion ofelement 5,6,7.The following section covers replacement.

Example 2: Maintaining Multiple Versions During Replacement

To start the replacement example,here are the last few lines of theexample in the previous section:

  1 q = kmalloc(sizeof(*p), GFP_KERNEL);  2 *q = *p;  3 q->b = 2;  4 q->c = 3;  5 list_replace_rcu(&p->list, &q->list);  6 synchronize_rcu();  7 kfree(p);

The initial state of the list, including the pointerp,is the same as for the deletion example:

Initial list state.

As before,the triples in each element represent the values of fieldsa,b, andc, respectively.The red borders oneach element indicate that readers might be holding references to them,and because readers do not synchronize directly with updaters,readers might run concurrently with this entire replacement process.Please note that we again omit the backwards pointers and the link from the tailof the list to the head for clarity.

Line 1kmalloc()s a replacement element, as follows:

List state afterallocation.

Line 2 copies the old element to the new one:

List state aftercopy.

Line 3 updatesq->b to the value "2":

List state afterupdate of b.

Line 4 updatesq->c to the value "3":

List state afterupdate of c.

Now, line 5 does the replacement, so that the new element isfinally visible to readers.At this point, as shown below, we have two versions of the list.Pre-existing readers might see the5,6,7 element, butnew readers will instead see the5,2,3 element.But any given reader is guaranteed to see some well-defined list.

List state afterreplacement.

After thesynchronize_rcu() on line 6 returns,a grace period will have elapsed, and so all reads that started before thelist_replace_rcu() will have completed.In particular, any readers that might have been holding referencesto the5,6,7 element are guaranteed to have exitedtheir RCU read-side critical sections, and are thus prohibited fromcontinuing to hold a reference.Therefore, there can no longer be any readers holding referencesto the old element, as indicated by the thin black border aroundthe5,6,7 element below.As far as the readers are concerned, we are back to having a single versionof the list, but with the new element in place of the old.

List state aftergrace period.

After thekfree() on line 7 completes, the list willappear as follows:

List state aftergrace period.

Despite the fact that RCU was named after the replacement case,the vast majority of RCU usage within the Linux kernel relies onthe simple deletion case shown in theprevious section.

Discussion

These examples assumed that a mutex was held across the entireupdate operation, which would mean that there could be at most twoversions of the list active at a given time.

Quick Quiz 4:How would you modify the deletion example to permit more than twoversions of the list to be active?

Quick Quiz 5:How many RCU versions of a given list can be active at any given time?

This sequence of events shows how RCU updates use multiple versionsto safely carry out changes in presence of concurrent readers.Of course, some algorithms cannot gracefully handle multiple versions.There aretechniques[PDF]for adapting such algorithms to RCU,but these are beyond the scope of this article.

Conclusion

This article has described the three fundamental components of RCU-basedalgorithms:

  1. a publish-subscribe mechanism for adding new data,

  2. a way of waiting for pre-existing RCU readers to finish, and

  3. a discipline of maintaining multiple versions to permitchange without harming or unduly delaying concurrent RCU readers.

Quick Quiz 6:How can RCU updaters possibly delay RCU readers, given that thercu_read_lock() andrcu_read_unlock()primitives neither spin nor block?

These three RCU componentsallow data to be updated in face of concurrent readers, andcan be combined in different ways toimplement a surprising variety of different types of RCU-based algorithms,some of which willbe the topic of the next installment in this "What is RCU, Really?"series.

Acknowledgements

We are all indebted to Andy Whitcroft, Gautham Shenoy, and Mike Fulton,whose review of an early draft of this document greatly improved it.We owe thanks to the members of the Relativistic Programming projectand to members of PNW TEC for many valuable discussions.We are grateful to Dan Frye for his support of this effort.Finally, this material is based upon work supported by the National ScienceFoundation under Grant No. CNS-0719851.

This work represents the view of the authors and does not necessarilyrepresent the view of IBM or of Portland State University.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks orservice marks of others.

Answers to Quick Quizzes

Quick Quiz 1:But doesn't seqlock also permit readers and updaters to get work doneconcurrently?

Answer:Yes and no.Although seqlock readers can run concurrently withseqlock writers, whenever this happens, theread_seqretry()primitive will force the reader to retry.This means that any work done by a seqlock reader running concurrentlywith a seqlock updater will be discarded and redone.So seqlock readers canrun concurrently with updaters,but they cannot actually get any work done in this case.

In contrast, RCU readers can perform useful work even in presenceof concurrent RCU updaters.

Quick Quiz 2:What prevents thelist_for_each_entry_rcu() fromgetting a segfault if it happens to execute at exactly the sametime as thelist_add_rcu()?

Answer: On all systems running Linux, loads from and storesto pointers are atomic, that is, if a store to a pointer occurs atthe same time as a load from that same pointer, the load will returneither the initial value or the value stored, never some bitwise mashupof the two.In addition, thelist_for_each_entry_rcu() always proceedsforward through the list, never looking back.Therefore, thelist_for_each_entry_rcu() will either seethe element being added bylist_add_rcu(), or it will not,but either way, it will see a valid well-formed list.

Back to Quick Quiz 2.

Quick Quiz 3:Why do we need to pass two pointers intohlist_for_each_entry_rcu()when only one is needed forlist_for_each_entry_rcu()?

Answer: Because in an hlist it is necessary to check forNULL rather than for encountering the head.(Try coding up a single-pointerhlist_for_each_entry_rcu().If you come up with a nice solution, it would be a very good thing!)

Back to Quick Quiz 3.

Quick Quiz 4:How would you modify the deletion example to permit more than twoversions of the list to be active?

Answer:One way of accomplishing this is as follows:

spin_lock(&mylock);p = search(head, key);if (p == NULL)spin_unlock(&mylock);else {list_del_rcu(&p->list);spin_unlock(&mylock);synchronize_rcu();kfree(p);}

Note that this means that multiple concurrent deletions might bewaiting insynchronize_rcu().

Back to Quick Quiz 4.

Quick Quiz 5:How many RCU versions of a given list can be active at any given time?

Answer:That depends on the synchronization design.If a semaphore protecting the update is held across the grace period,then there can be at most two versions, the old and the new.

However, if only the search, the update, and thelist_replace_rcu() were protected by a lock, thenthere could be an arbitrary number of versions active, limited onlyby memory and by how many updates could be completed within agrace period.But please note that data structures that are updated so frequentlyprobably are not good candidates for RCU.That said, RCU can handle high update rates when necessary.

Back to Quick Quiz 5.

Quick Quiz 6:How can RCU updaters possibly delay RCU readers, given that thercu_read_lock() andrcu_read_unlock()primitives neither spin nor block?

Answer:The modifications undertaken by a given RCU updater will cause thecorresponding CPU to invalidate cache lines containing the data,forcing the CPUs running concurrent RCU readers to incur expensivecache misses.(Can you design an algorithm that changes a data structurewithoutinflicting expensive cache misses on concurrent readers?On subsequent readers?)

Back to Quick Quiz 6.

Index entries for this article
KernelRead-copy-update
GuestArticlesMcKenney, Paul E.


to post comments

example code nitpicks

Posted Dec 20, 2007 21:15 UTC (Thu) byntl (subscriber, #40518) [Link] (2 responses)

Shouldn't the list_head and hlist_node structs be embedded in struct foo, instead of pointers? That is,
struct foo {-   struct list_head *list;+   struct list_head list;   ...}
Also, no need to cast the return value of kmalloc :)

example code nitpicks

Posted Dec 21, 2007 1:03 UTC (Fri) byPaulMcKenney (✭ supporter ✭, #9624) [Link] (1 responses)

Good catch in both cases!

example code nitpicks

Posted Dec 21, 2007 17:08 UTC (Fri) byPaulMcKenney (✭ supporter ✭, #9624) [Link]

And many thanks to Jon Corbet for applying the fixes for these problems!

Problem with ex 2 ?

Posted Dec 23, 2007 23:23 UTC (Sun) byxav (guest, #18536) [Link] (1 responses)

It looks to me that example 2 is wrong: access to p is unlocked, so it can changed under itsfeet if preempted: *q = *p can access freed memory.

Problem with ex 2 ?

Posted Dec 27, 2007 18:57 UTC (Thu) byPaulMcKenney (✭ supporter ✭, #9624) [Link]

Indeed!  As with the preceding example, there must be some sort of mutual exclusion in play,and as in the preceding example, this mutual exclusion is not shown explicitly.One approach, as you say, would be to add locking, perhaps similar to that described in theanswer to Quick Quiz 4 (but for the deletion example).  Another approach would be to hold amutex (in the old style, "semaphore") across the entire code segment.  Yet a third approachwould be to permit only a single designated task to do updates (in which case the code wouldremain as is).  There are other approaches as well.In any case, I am glad to see that people are sensitized to the need for mutual exclusion inparallel code!

What is RCU, Fundamentally?

Posted Dec 26, 2007 12:54 UTC (Wed) byunion (guest, #36393) [Link] (3 responses)

Since nobody else said it. Thanks for a great article.I liked the articles about memory and this looks like another great miniseries. While I spendmost of my days programming python and Java,I believe that understanding such lower level concepts makes me better programmer. I hope that there will be more such background or fundamentals articles in the future.Luka

What is RCU, Fundamentally?

Posted Dec 27, 2007 19:04 UTC (Thu) byPaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

Glad you liked it!!!  ;-)Garbage-collected languages such as Java implement RCU's "wait for pre-existing RCU readers tocomplete" implicitly via the garbage collector.  However, in Java, you must make careful useof atomic variables in order to correctly implement the publish-subscribe mechanism.  Last Iknew, Java would emit memory barriers for the subscribe operation, even when unnecessary, butperhaps that has changed by now.  However, in many cases, the memory-barrier overhead mightwell be acceptable (e.g., when avoiding contention).So you might well be able to use RCU techniques in Java!  (For whatever it is worth, Kung andLehman described the gist of using garbage collectors in this fashion in their paper entitled"Concurrent Maintenance of Binary Search Trees" in "ACM Transactions on Database Systems" backin September 1980.)

What is RCU, Fundamentally?

Posted Dec 27, 2007 23:47 UTC (Thu) byscottt (subscriber, #5028) [Link] (1 responses)

Alink to the paper.

With people doing all these advanced concurrent algorithms in the eighties I wonder why we're only just now getting a formal memory model for the C/C++ programming languages.

What is RCU, Fundamentally?

Posted Dec 28, 2007 15:22 UTC (Fri) byPaulMcKenney (✭ supporter ✭, #9624) [Link]

Thank you for providing the link -- much better than my old hard copy!  It is very good indeedto see some of these older papers becoming available on the web.There was indeed some memory-consistency-model research done some decades ago.  Although Icannot claim comprehensive knowledge of memory-model research, as near as I can tell, almostall of this older research was swept aside by the notion of sequential consistency in 1979.The lion's share of later research assumed sequential consistency, which rendered this researchless than helpful on weakly-ordered machines, where "weakly ordered" includes x86.  Algorithmsthat fail to work on x86 cannot be said to have much practical value, in my opinion.There were nevertheless some fundamental papers published in the 90s (e.g., Sarita Adve andKourosh Gharachorloo, among others), but many of the researchers focused on "how to emulatesequential consistency on weak-memory machines" as opposed to "how best to expressmemory-ordering constraints while allowing efficient code to be generated for systems with awide variety of memory consistency models".  It is this latter statement that the C/C++standards committee must address.

Backlinks

Posted Dec 28, 2007 12:37 UTC (Fri) bymmutz (guest, #5642) [Link] (1 responses)

I'm wondering whether the omission of the backlinks in the examples is a good thing. The omission makes the technique trivial, since publishing only involves one replacing one pointer.What about the second, back, one? Without support for atomic two-pointer updates, how can both the p->prev->next = q and p->next->prev = q updates be performed without risking clients to see an inconsistent view of the doubly linked list? Or is that not a problem in practice?Thanks for the article, though. Looking forward to the next installment!

Backlinks

Posted Dec 28, 2007 15:35 UTC (Fri) byPaulMcKenney (✭ supporter ✭, #9624) [Link]

Glad you liked the article, and thank you for the excellent question! I could give any number of answers, including:
  1. In production systems, trivial techniques are a very good thing.
  2. Show me an example where it is useful to traverse the ->prev pointers under RCU protection. Given several such examples, we could work out how best to support this.
  3. Consistency is grossly overrated. (Not everyone agrees with me on this, though!)
  4. Even with atomic two-pointer updates, consider the following sequence of events: (1) task 1 does p=p->next (2) task 2 inserts a new element between the two that task 1 just dealt with (3) task 1 does p=p->prev and fails to end up where it started! Even double-pointer atomic update fails to banish inconsistency! ;-)
  5. If you need consistency, use locks.

Given the example above, we could support a level of consistency equivalent to the double-pointer atomic update simply by assigning the pointers in sequence -- just remove the prev-pointer poisoning from list_del_rcu(), for example. But doing this would sacrifice the ability to catch those bugs that pointer-poisoning currently catches.

So, there might well come a time when the Linux kernel permits RCU-protected traversal of linked lists in both directions, but we need to see a compelling need for this before implementing it.

Forcing the compiler and CPU to execute assignment statements in order ?

Posted Dec 30, 2007 16:51 UTC (Sun) byBrenner (guest, #28232) [Link] (2 responses)

Thanks for the article;In the 'Publish-Subscribe Mechanism', the article states : "Unfortunately, there is nothingforcing the compiler and CPU to execute the last four assignment statements in order."This breaks my mind. I understand concurrency problems with multiple tasks, but here only onetask is considered (or did I misunderstand something ?). Why would the CPU not execute thelast four assignment statements in the order given by the compiler, and why would the compilernot keep the order given by the programmer ???Even with multiple tasks, I thought that the order was guaranteed (I agree that many thingscan happen between two code lines of one given task if other tasks are running though).Can anyone enlighten me ?

Forcing the compiler and CPU to execute assignment statements in order ?

Posted Jan 1, 2008 23:10 UTC (Tue) byPaulMcKenney (✭ supporter ✭, #9624) [Link] (1 responses)

You are welcome!

Your question about ordering is a good one, and code reordering by both CPU and compiler can be grossly counter-intuitive at times. So an explanation is indeed in order. Let's start with the compiler. The code was as follows:

 11 p->a = 1; 12 p->b = 2; 13 p->c = 3; 14 gp = p;

It is possible that the compiler holds the value ofgp in a register, but that it will need to spill that value in order to generate the code for lines 11-14. What to do? Well, as long as the compiler can prove thatp andgp do not point to overlapping regions of memory, the compiler is within its rights to generate the code for line 14 before generating the code for the other lines. In this case, rearranging the code in this manner reduces code size, probably increases performance, and hurts no one. That is, it hurts no onein the absence of concurrency. However, please keep in mind that the C standard currently does not allow concurrency, strange though that may seem to those of us who have been writing concurrent C programs for decades.

So special non-standard compiler directives (barrier() in the Linux kernel, or, when used properly,volatile) are required to force the compiler to maintain ordering. Note that such directives are included, either directly or indirectly, in primitives that require ordering, for example, thesmp_mb() memory barrier in the Linux kernel.

On to the CPU. Suppose that the compiler generates the code in order, and that the CPU executes it in order. What could possibly go wrong?

The store buffer. The CPU will likely commit the new values ofp->a,p->b,p->c, andgp to the store buffer. If the cache line referenced byp happens to be owned by some other CPU (for example, if the memory returned by the precedingkmalloc() had beenkfree()ed by some other CPU) and if the cache line containinggp is owned by the current CPU, the first three assignments will be flushed from the store buffer (and thus visible to other CPUs) long after the last assignment will be.

In addition, superscalar CPUs routinely execute code out of order. Such beasts might become less common as power-consumption concerns rise to the fore, but there are still quite a few such CPUs out there, dating from the mid-1990s for commodity microprocessors and to the mid-1960s for supercomputers. For example,Tomasulo's Algorithm, dating from the mid-1960s, is specifically designed to allow CPUs to execute instructions out of order. And there were super-scalar supercomputers pre-dating Tomasulo's Algorithm.

In short, if ordering is important to you, use primitives to enforce ordering. Such primitives include locking primitives (e.g.,spin_lock() andspin_unlock() in the Linux kernel), the RCU publish-subscribe primitives called out in this article, and of course explicit memory barriers. (I would recommend avoiding explicit memory barriers where feasible -- they are difficult to get right.)

Hey, you asked!!! And Happy New Year!!!

Forcing the compiler and CPU to execute assignment statements in order ?

Posted Jan 2, 2008 0:09 UTC (Wed) byBrenner (guest, #28232) [Link]

Thanks a lot for your very informative answer. And Happy New Year!!!

Userland

Posted Jan 4, 2008 4:06 UTC (Fri) byeduardo.juan (guest, #47737) [Link] (3 responses)

Thanks for this excellent article!I have a fundamentally question! :DIs any way to use RCU on userland programs? I mean, any lib or implementation of thisconstraints in userland?Thanks and regards!

Using RCU in userland

Posted Jan 4, 2008 18:13 UTC (Fri) byPaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

RCU has been used experimentally in user-mode code, and one benchmarking framework may be found atU. of Toronto.

In addition, I vaguely recall at least one user-level RCU implementation being posted on LKML as a programming/debugging aid. There are some others that I am unfortunately unable to release at this time.

As noted in an earlier comment, garbage-collected languages automatically provide much of the wait-for-reader functionality for free -- however, it is still necessary to correctly handle the publish-subscribe operation correctly. One way to do this in Java is to use the recently added "volatile" field specifier (no relation to "volatile" in C/C++) for the pointer that is being published. In other words, instead of using the Linux-kernel rcu_assign_pointer() and rcu_dereference() to publish and subscribe, you instead mark the pointer itself using Java's "volatile" keyword.

Using RCU in userland

Posted Jan 4, 2008 20:08 UTC (Fri) byeduardo.juan (guest, #47737) [Link] (1 responses)

Thanks Paul for answering! I'll be trying it!Regards

Using RCU in userland

Posted Jan 5, 2008 16:26 UTC (Sat) byPaulMcKenney (✭ supporter ✭, #9624) [Link]

Very cool!  Please let me know how it goes!

Some usages of RCU in the kernel code

Posted Jan 6, 2008 10:42 UTC (Sun) byfbh (guest, #49754) [Link] (7 responses)

Hi,I'm probably missing something but I looked at users of RCU in the kernel and found thatkernel/kprobes.c doesn't call rcu_read_lock() function before entering in a critical section.Does it have any good reasons to do so ?Another point in include/linux/list.h: sometimes it seems that RCU API is not used for no goodreasons. For example __list_add_rcu() does not use rcu_dereference(). However by using it, thecode could have clearly showed which pointer is RCU protected...The Alpha memory ordering and value-speculation compiler optimizations done in thisarchitecture sounds incredibely weird to me. Any avalaible pointers on this subject ?Anyways, thanks for this article.

Some usages of RCU in the kernel code

Posted Jan 6, 2008 19:04 UTC (Sun) byPaulMcKenney (✭ supporter ✭, #9624) [Link] (6 responses)

The reason that kernel/kprobes.c does not usesynchronize_rcu() is that it instead usessynchronize_sched(). The read-side primitives corresponding tosynchronize_sched() includepreempt_disable()/preempt_enable(),local_irq_save()/local_irq_restore() -- in short, any primitive that disables and re-enables preemption, including entry to hardware irq/exception handlers. The third article in this series will cover these and other nuances of the RCU API.

The reason that__list_add_rcu() does not usercu_dereference() is that__list_add_rcu() is an update-side primitive, and therefore must be protected by appropriate locking. This means that the list cannot change while the lock is held, and this in turn means that the memory barriers implied by lock acquisition suffice. That said, it is permissible (but again, not necessary) to usercu_dereference() in update-side code -- in fact, in some situations, doing so promotes code reuse and in some cases readability.

The discussion of Figure 2 inthis article and its bibliography is a good place to start on Alpha's memory ordering. I am not all that familiar with value-speculation compiler-optimization research, but one place to start might behere.

Some usages of RCU in the kernel code

Posted Jan 7, 2008 10:14 UTC (Mon) byfbh (guest, #49754) [Link] (3 responses)

Regarding the second point, I'm sorry, my fingers typed rcu_dereference() whereas my brain wasthinking to rcu_assign_pointer(). rcu_dereference() could be used in update-side code, as youpointed out, but obviously not in __list_add_rcu().Therefore __list_add_rcu() could be:        new->next = next;        new->prev = prev;        rcu_assign_pointer(prev->next, new);        next->prev = new;The main advantage of this is readability IMHO.You said that __list_add_rcu() must be protected by appropriate locking, and a memory barrieris implied by lock acquisition which should be enough. But __list_add_rcu() has a memorybarrier in its implementation.Thanks.

Some usages of RCU in the kernel code

Posted Jan 7, 2008 15:28 UTC (Mon) byPaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

Good point!  Would you like to submit a patch?

Some usages of RCU in the kernel code

Posted Jan 7, 2008 16:03 UTC (Mon) byfbh (guest, #49754) [Link] (1 responses)

Well, I actually raised 2 points and I'm not sure which one is the good one...Could you please clarify ?To answer your question, I don't have any problems for submitting a patch.---PS: This interface is pretty hard for discussing on an article. It's like bottom-postingexcept we loose all its advantages... Just my 2 cents.

Some usages of RCU in the kernel code

Posted Jan 7, 2008 16:52 UTC (Mon) byPaulMcKenney (✭ supporter ✭, #9624) [Link]

I was inviting a patch for incorporating the smp_wmb() into the rcu_assign_pointer.The lock acquisition makes rcu_dereference() unnecessary, but cannot enforce the orderingprovided by the smp_wmb()/rcu_assign_pointer().

Some usages of RCU in the kernel code

Posted Jan 8, 2008 11:02 UTC (Tue) byjarkao2 (guest, #41960) [Link] (1 responses)

Thanks for this (next) great article too! BTW, probably 'a' tiny fix needed:"Maintain Multiple Versions of Recently Updated Objects[...] Two examples are presented showing how a an element [...]"

Some usages of RCU in the kernel code

Posted Jan 8, 2008 14:13 UTC (Tue) byPaulMcKenney (✭ supporter ✭, #9624) [Link]

Good eyes!!!

What is RCU, Fundamentally?

Posted Aug 6, 2011 10:47 UTC (Sat) bystock (guest, #5849) [Link] (1 responses)

"One key attribute of RCU is the ability to safely scan data, even
though that data is being modified concurrently. To provide this
ability for concurrent insertion, RCU uses what can be thought of
as a publish-subscribe mechanism."

Does this not read like that once announced Microsoft Filesystem called
WinFS ? From the WinFS wikipedia penal records :

http://en.wikipedia.org/wiki/WinFS

"WinFS (short for Windows Future Storage)[1] is the code name for a
cancelled[2] data storage and management system project based on
relational databases, developed by Microsoft and first demonstrated
in 2003 as an advanced storage subsystem for the Microsoft Windows
operating system, ...
[ ... ]
WinFS includes a relational database for storage of information,
and allows any type of information to be stored in it, provided
there is a well defined schema for the type."

It's funny to see IBM, amongst others, explain the 'dirty' details about 'RCU'
here on LWN.

Robert
--
Robert M. Stockmann - RHCE
Network Engineer - UNIX/Linux Specialist
crashrecovery.org stock@stokkie.net

What is RCU, Fundamentally?

Posted Aug 6, 2011 13:51 UTC (Sat) bynix (subscriber, #2304) [Link]

Does this not read like that once announced Microsoft Filesystem called WinFS ?
No. WinFS was a 'store everything in a relational database' thing. This is utterly different, nothing relational at all.

What is RCU, Fundamentally?

Posted May 24, 2015 12:50 UTC (Sun) byfirolwn (guest, #96711) [Link] (2 responses)

I really didn't understand Quick Quiz 5.
Why does semaphore affect the numbers of RCU version list?

What is RCU, Fundamentally?

Posted Jun 8, 2015 18:32 UTC (Mon) byPaulMcKenney (✭ supporter ✭, #9624) [Link] (1 responses)

In the following code, we have at most two versions:
mutex_lock(&my_mutex);p = find_an_element(key);list_del_rcu(&p->list);synchronize_rcu();kfree(p);mutex_unlock(&my_mutex);
Only one task at a time may holdmy_mutex, so there can be at most two versions of the list, the new one with the element deleted, and for pre-existing readers, the old one with that element still in place. (When this article was written, the Linux kernel used semaphores for sleeplocks, but it now uses mutexes.)

In contrast, suppose that the mutex is held only across the deletion:

mutex_lock(&my_mutex);p = find_an_element(key);list_del_rcu(&p->list);mutex_unlock(&my_mutex);synchronize_rcu();kfree(p);
In this case, many updaters could be waiting for grace periods in parallel, so there could be many concurrent versions of the list.

What is RCU, Fundamentally?

Posted Jun 3, 2024 6:57 UTC (Mon) bysummergift (guest, #171661) [Link]

Thank you for your explanation!

Links to later articles in the series

Posted Sep 1, 2024 8:23 UTC (Sun) byfraetor (subscriber, #161147) [Link]

Links to the other two parts of the series:


Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds


[8]ページ先頭

©2009-2026 Movatter.jp