- Notifications
You must be signed in to change notification settings - Fork1.7k
Introduce intrusive split-list and compare its scanning performance with some other collections#3138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Conversation
xemul commentedDec 8, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
TODO list for this PR
|
a8fcbd6 to8acacc4Comparexemul commentedDec 8, 2025
upd:
|
xemul commentedDec 8, 2025
split-list main loop decoded |
| bi::cache_last<true>, | ||
| bi::member_hook<element, bi::slist_member_hook<>, &element::l_next>>; | ||
| static_assert(sizeof(slist) <=2 *sizeof(void*)); | ||
| slist _singly_linked_list; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
curious if the intrusive list was what you reached for first but found the performance overhead to be too much? we had encountered this is doubly-linked intrusive list (even with all the safe/auto stuff turned off), and hand coded a simple version of effectively the same thing and got like a 10x performance reduction. not sure if slist has the same issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, I first played with intrusive list in#3125 and saw that it's times worse that circular_buffer<T*>. And then came this split-list thing
| } | ||
| T*pop_front()noexcept { | ||
| auto i = _head_idx++ % F; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I wonder if there's any performance gain to be had by limiting F to be a power of two, and using & instead of %. chunked_fifo does this, and has
static_assert((items_per_chunk & (items_per_chunk -1)) == 0, "chunked_fifo chunk size must be power of two");
Alternatively, maybe the optimizer will already convert "%" to "&" if it's a power of two? Did you try your benchmark with different choices of F, power of two or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, I did. When it's power-of-two compiler (in release build) uses & :#3138 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If this has a noticeable impact on performance, I think a comment recommending to use a power-of-two for F is in order. If the impact is very noticable, perhaps it makes sense to just force it, with a static_assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, using power-of-two F is very welcome
clang-20
| F | intsructions | cycles |
|---|---|---|
| 17 | 15.5 | 6.6 |
| 16 | 8.5 | 4.8 |
| 15 | 14.5 | 6.2 |
GCC-15
| F | intsructions | cycles |
|---|---|---|
| 17 | 16.0 | 6.0 |
| 16 | 10.0 | 5.1 |
| 15 | 16.0 | 6.7 |
| /* | ||
| * Copyright (C) 2025 ScyllaDB | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This file desperately needs a fairly long comment explaining the purpose of this new container, some overview of how it works, and a comparison to other similar containers (list, dequeue, circular_buffer, chunked_fifo, boost::intrusive::slist) helping a reader understand when to choose this one over other alternatives.
See an example in chunked_fifo.hh.
By the way, any reason why the older chunked_fifo.hh should be in core/ but this new one is in util/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, absolutely. The "add documentation" item is in theTODO list
By the way, any reason why the older chunked_fifo.hh should be in core/ but this new one is in util/?
No specific reason, I just thought it was an auxiliary container, not "core" one whatever it means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think chunked_fifo.hh and circular_buffer.hh predate the separate directory for all the include files (include/seastar/), so they were in core/ because they wereused in the core's implementation - and not really intended to be used by end-users. The placement in include/core is not because the header file is any more "core" or important for end-users than other header files.
In that sense, the new list is also used in the core's implementation so could be in core/ instead of util/ - like its other friends, chunked_fifo and circular_buffer.
But maybe this discussion is actually a sign that when created a new include/ directory, we shouldn't have kept the distinct "core" vs "util" subdirectory :-(
nyh commentedDec 10, 2025
This is a really interesting and clever implementation. I think its benefit over regular intrusive singly-linked lists is closely tied to how prefetching works on contemporary CPUs - that it actually prefetches data pointed by any pointer it sees, so it is helpful to let it see a few (F) pointers in advance instead of just the next one. I wonder how much this advantage remains if the scanning code does something more than just advance but actually do a bit more work on each item - in this case the pre-fetching of F items in advance might be too early - as we still have a lot of work to do before we're ready to need the Fth next item. Also in this case the speedup of just the scan might be negligible compared to the real amount of work we need to do on the items. Do you know if something like your "split list" was ever done before? I searched a bit and got to a concept called a "chain" for each of the individual linked list, and "multi-chain" for the idea of having multiple independent lists each allowing one prefetching - but most discussions I read about these ideas where how to write a better prefetcher - not how to write a linked list that works better with the existing prefetcher. Maybe you can write a paper - or at least a blog post - on this new (?) technique :-) You mentioned that you were looking for the ability to "push_back and push_front" - does your intended use case (run-queue) actually use both? chunked_fifo<> doesn't do push_front, but this was a decision to simplify the API - it's not that it was algorithmically impossible to implement. Of course if your goal is to be anintrusive collection, then obviously chunked_fifo is not what you're looking for and you needed to create a new type. But in that case I think the wordintrusive should be part of the new collection's name, and part of the pull request's title. If, on the other hand, "intrusive" was not very important to you - it appears that the existing chunked_fifo is actually a tiny bit more efficient than the new split_list, right? |
xemul commentedDec 10, 2025
It's perfectly valid concern. I will be able to answer it more reliably after I rebase#3125 onto this new container and see more real-life-ish example of tasks queue processing. There's a perf test for runqueue scan in that PR too. It runs no-op tasks, so it's still a simplification.
I'm pretty sure that someone did that. At least, Redpanda says that theycoded something similar. I also tried to find more info aboutthis Avi's concen, and also found some "related" discussions, but overall no, I didn't find any deep research on that topic.
I already did :) I'll send you a link
Yes.
Yes, definitely. Renaming of the collection is in theTODO list as well.
Yes, that's correct. And chunked_fifo is actuallyfaster than split-list, so we have a choice of three elements
(and a reminder, to have some reference number for those "cycles" numbers -- plain intrusive list takes 25.8 cycles per element; all numbers are for 10k elements in the collection) |
xemul commentedDec 10, 2025
upd:
|
8acacc4 toaf69879Comparexemul commentedDec 10, 2025
upd:
|
xemul commentedDec 10, 2025
Documentation is yet to be written 😕 |
avikivity commentedDec 10, 2025
Looks like chunked_fifo wins. And again I need to re-measurehttps://github.com/avikivity/seastar/commits/sort-tasks/, I found some bugs in it. |
xemul commentedDec 10, 2025
Looks like it doesn't :) I measured on i7i.2xlarge instance with Xeon processor -- split-list 2.1 vs chunked-fifo 2.4 |
xemul commentedDec 10, 2025
I updated the PR description with full table |
xemul commentedDec 10, 2025
Added (to PR description) results for Threadripper, but with different compiler (clang-20). Split-list 4.9 vs chunked-fifo 6.9 |
The split list is the collection of objects that's optimized forscanning from begin to end, with the ability to add entries from twoends -- front and back, like std::deque does.Splitting is needed to let CPU pre-fetch next element while scanningsome current one. With classical linked lists this is difficult, becausein order to prefetch the next element CPU needs to read the next pointerfrom the current element, which is being read. Thus the forward scan ofa plain list is serielized. With split lists several adjacent elementscan be prefetched in parallel.To achieve that, the "first" pointer is "sharded" -- there are F firstpointers (as well as F cached last pointers) and the list is populatedin round-robin manner -- first 0th list is appended, then the 1st, thenthe 2nd, ... then the Fth and population wraps around and starts over.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add the `--parameter name=value` vector option. Tests may then do`perf_tests::get_parameter("foo")` to get the user-configured parameterand update themselves accordingly.It will be useful for future scanning tests, as they show very differentresults on collections of different sizes, and hard-coding a test foreach interesting collection size is too much code.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>This test allcoates a vector of elements containing integer, andmeasures the time it takes to scan the collection sequentially. Elementsare artificially enarged to occupy 64 bytes, so that this collectiondoesn't benefit from caching several values on read.This is basic reference test for further testing (next patches).Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extend the vector<element> scanning test with vector<element*> scanningone. The pointers are randomized.This test works almost as fast as the base reference test does, becauseat most one element fits a cache-line, so CPU caching is effectively offhere.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extend the test with boost::intrusive_list<element> scan. This is toshow that such scanning, despite occupying less memory, works _slower_than base reference test. This is becase CPU cannot prefetch adjacentelements in parallel -- the next pointer needs to be read anyway.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use split-list with the factor of 16 as it shows the nicest results onmy CPU (Ryzen Threadripper 2970WX).This test beats singly linked list greatly, and works ~2x times sloweron 10k elements that the array of pointers does. But unlike array ofpointers, it doesn't require memory allocation on appending operations.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Here the fifo stores pointers to elements, to behave like array ofpointers does, yet avoiding large (re)allocations on append.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
af69879 toed08c21Comparexemul commentedDec 10, 2025
upd:
|
| } | ||
| size_tsize()constnoexcept { | ||
| return _tail_idx - _head_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I just realized that in your implementation _tail_idx and _head_idx can keep growing as the queue is written and read over a long period of time, and wrap around the 32-bit unsigned. This is fine, almost, except for one thing: the "%" operation doesn't wrap around as expected when F is not a power of two. For example, consider F=5.
If _tail_idx reaches the last unsigned integer, _tail_idx=4294967295, and _tail_idx%5 = 0. Then we wrap around to _tail_idx=0, and _tail_idx%5 = 0 again!
So I think you need the static_assert that F is a power of two - not just for performance but also for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Wow, what a catch! Thanks!
I assume circular_buffer rounds up its size to power-of-two for the same reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
By the way, with push_front setting head_idx to -1 immediately, it might be easy to write a test that reproduces this bug when F is not a power of two.
| std::array<T*, F> _first; | ||
| std::array<T**, F> _last_p; | ||
| unsigned _tail_idx; | ||
| unsigned _head_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This "unsigned" limits the the number of elements in the list to 2^32 - and pushing more than this will, if I understand correctly, fail silently. I don't think this limit will bother us, but perhaps it should be documented at least.
| public: | ||
| iterator(unsigned idx)noexcept | ||
| : _cur({nullptr}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
By the way, I think the {nullptr} here is confusing on what it really does -
I think (but correct me if I'm wrong) that it only initializes the first entry to nullptr explicitly, and then by default initializes the other ones to nullptr as well. So I think just {} would do exactly the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, it initializes zeroth with nullptr and the rest with defaults (nullptrs as well)
It's a old-C-way of zeroing-out an array
I'll change it to{}
mitso23 left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looking at the performance metrics, it appears that split_list and chunked_fifo show very similar performance - with chunked_fifo actually being slightly faster in some configurations (e.g., AMD Ryzen + GCC-15).
Would it be worth considering chunked_fifo as a first iteration? It's already well-established in Seastar and might solve the performance problem you're targeting without introducing a new data structure.
Additionally, it would be helpful to include circular_buffer in the benchmarks, since that's the current implementation for the scheduler run-queue. This would provide a clearer picture of the actual improvement.
I assume the main concern with circular_buffer is the potential reallocation when it reaches max capacity - is that correct?
xemul commentedDec 11, 2025
It's better to discuss it in#3125 , because that's the place where the new collection is actually used. This PR just adds some benchmarks. However ...
Not only. It's other main concern is that it allocates at all, thus making task wakeup potentially failing operation. However, I agree, that the problem is amortized, the buffer is never shrank back, so once it passes a wake-up storm, it survives until next, larger tsunami.
Agree. Actually, the purpose of this test was to show the difference in handling different memory layouts, and circular_buffer is layed out as vector. But I think that testing differentimplementations of the same layout is worthwhile. |
avikivity commentedDec 11, 2025
My feeling is that this thing is more sensitive to different workloads than chunked_fifo. A run with very small+fast tasks will prefer large F, so that prefetch can happen in time. I believe the second effect is much less important so we'd choose a reasonably large F (say, 8 -- what did you measure with?). A different objection is that I'd like to pursue my sort-tasks thing which un-FIFOs the task queue. But it needs more work. |
avikivity commentedDec 11, 2025
The tasks themselves are allocated; if we can allocate a task, we can allocate a chunk for chunked_fifo. |
xemul commentedDec 11, 2025
16 |
| classintrusive_split_list { | ||
| std::array<T*, F> _first; | ||
| std::array<T**, F> _last_p; | ||
| unsigned _tail_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
why not using 64 bit unsigned int, then you don't really need to worry about the tail wrapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Head wraps anyway (immediately after push_front() that happens before pop_front()). And 32 bits is more than enough for typical use-cases, seastar shard may have at most 64Gb, so 32-bit index is enough for 16-bytes objects, I don't think it's a problem
| template<typename T,unsigned F, T* T::*Member> | ||
| classintrusive_split_list { | ||
| std::array<T*, F> _first; | ||
| std::array<T**, F> _last_p; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Wouldn't it be more readable to use dummy nodes as head of link list, you would't need then to use T**
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
T can be an abstract type, likeclass task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Also it's crucial to havefirst pointers packed as tight as possible to be cached on read, with dummy nodes it's not going to work as nice
| } | ||
| }; | ||
| PERF_TEST_F(sum_perf, sum_plain) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Out of curiosity why are not using google benchmark, what are the benefits of using seastar benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Just because there's seastar perf test suite in the repo :)
| // checks that forward scanning reveals consistent results | ||
| BOOST_AUTO_TEST_CASE(test_split_list_grow_and_shrink) { | ||
| constauto seed =std::random_device()(); | ||
| auto rnd_engine =std::mt19937(seed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You should also test iterator logic and maybe wrap around logic if you want to use 32bit unsigned integers for head/tail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Iterator logic is checked with
}elseif (op ==2) {// scanunsigned scanned =0;fmt::print("= {} ... {}\n", first, last);for (auto it = sl.begin(); it != sl.end(); ++it) {fmt::print(" {}\n", it->value);BOOST_REQUIRE_EQUAL(it->value, first +1 + scanned); scanned++; }BOOST_REQUIRE_EQUAL(scanned, last - first);
Uh oh!
There was an error while loading.Please reload this page.
The purpose of this PR is to demonstrate how different "queues" (see below) perform on full-scan scenario.
Here a "queue" is a collection of objects that
Initially such a "queue" was met in CPU scheduler run-queue. In it, the stored element is
task*and actual elements are classes that inherit from class task. The run-queue is implemented ascircular_buffer<task*>container.Scan performance comparison on 10k elements of different collections is:
On AMD Ryzen Threadripper 2970WX processor
GCC-15.2.1
vector<T>vector<T*>chunked_fifo<T*, 128>split_list<T, 16>intrusive_list<T>Clang-20.1.8
vector<T>vector<T*>split_list<T, 16>chunked_fifo<T*, 128>intrusive_list<T>On Intel i7-10750H processor
GCC-13.3.0
vector<T>vector<T*>split_list<T, 16>chunked_fifo<T*, 128>intrusive_list<T>On Intel(R) Xeon(R) Platinum 8559C processor (i7i.2xlarge instance)
Clang-20.1.2
vector<T>vector<T*>split_list<T, 16>chunked_fifo<T*, 128>intrusive_list<T>