Movatterモバイル変換

tests/perf/container_perf.cc

		bi::cache_last<true>,
		bi::member_hook<element, bi::slist_member_hook<>, &element::l_next>>;
		static_assert(sizeof(slist) <=2 sizeof(void));
		slist _singly_linked_list;

Copy link

Contributor

dotnwatDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

curious if the intrusive list was what you reached for first but found the performance overhead to be too much? we had encountered this is doubly-linked intrusive list (even with all the safe/auto stuff turned off), and hand coded a simple version of effectively the same thing and got like a 10x performance reduction. not sure if slist has the same issues.

Copy link

ContributorAuthor

xemulDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, I first played with intrusive list in#3125 and saw that it's times worse that circular_buffer<T*>. And then came this split-list thing

nyh reviewed

		}

		T*pop_front()noexcept {
		auto i = _head_idx++ % F;

Copy link

Contributor

nyhDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I wonder if there's any performance gain to be had by limiting F to be a power of two, and using & instead of %. chunked_fifo does this, and has

static_assert((items_per_chunk & (items_per_chunk -1)) == 0,            "chunked_fifo chunk size must be power of two");

Alternatively, maybe the optimizer will already convert "%" to "&" if it's a power of two? Did you try your benchmark with different choices of F, power of two or not?

Copy link

ContributorAuthor

xemulDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, I did. When it's power-of-two compiler (in release build) uses & :#3138 (comment)

Copy link

Contributor

nyhDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If this has a noticeable impact on performance, I think a comment recommending to use a power-of-two for F is in order. If the impact is very noticable, perhaps it makes sense to just force it, with a static_assert?

Copy link

ContributorAuthor

xemulDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, using power-of-two F is very welcome

clang-20

F	intsructions	cycles
17	15.5	6.6
16	8.5	4.8
15	14.5	6.2

GCC-15

F	intsructions	cycles
17	16.0	6.0
16	10.0	5.1
15	16.0	6.7

nyh reviewed


		/*
		* Copyright (C) 2025 ScyllaDB
		*/

Copy link

Contributor

nyhDec 10, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This file desperately needs a fairly long comment explaining the purpose of this new container, some overview of how it works, and a comparison to other similar containers (list, dequeue, circular_buffer, chunked_fifo, boost::intrusive::slist) helping a reader understand when to choose this one over other alternatives.

See an example in chunked_fifo.hh.

By the way, any reason why the older chunked_fifo.hh should be in core/ but this new one is in util/?

Copy link

ContributorAuthor

xemulDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, absolutely. The "add documentation" item is in theTODO list

By the way, any reason why the older chunked_fifo.hh should be in core/ but this new one is in util/?

No specific reason, I just thought it was an auxiliary container, not "core" one whatever it means

Copy link

Contributor

nyhDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think chunked_fifo.hh and circular_buffer.hh predate the separate directory for all the include files (include/seastar/), so they were in core/ because they wereused in the core's implementation - and not really intended to be used by end-users. The placement in include/core is not because the header file is any more "core" or important for end-users than other header files.

In that sense, the new list is also used in the core's implementation so could be in core/ instead of util/ - like its other friends, chunked_fifo and circular_buffer.

But maybe this discussion is actually a sign that when created a new include/ directory, we shouldn't have kept the distinct "core" vs "util" subdirectory :-(

Copy link

Contributor

nyh commentedDec 10, 2025

This is a really interesting and clever implementation. I think its benefit over regular intrusive singly-linked lists is closely tied to how prefetching works on contemporary CPUs - that it actually prefetches data pointed by any pointer it sees, so it is helpful to let it see a few (F) pointers in advance instead of just the next one. I wonder how much this advantage remains if the scanning code does something more than just advance but actually do a bit more work on each item - in this case the pre-fetching of F items in advance might be too early - as we still have a lot of work to do before we're ready to need the Fth next item. Also in this case the speedup of just the scan might be negligible compared to the real amount of work we need to do on the items.

Do you know if something like your "split list" was ever done before? I searched a bit and got to a concept called a "chain" for each of the individual linked list, and "multi-chain" for the idea of having multiple independent lists each allowing one prefetching - but most discussions I read about these ideas where how to write a better prefetcher - not how to write a linked list that works better with the existing prefetcher. Maybe you can write a paper - or at least a blog post - on this new (?) technique :-)

You mentioned that you were looking for the ability to "push_back and push_front" - does your intended use case (run-queue) actually use both? chunked_fifo<> doesn't do push_front, but this was a decision to simplify the API - it's not that it was algorithmically impossible to implement. Of course if your goal is to be anintrusive collection, then obviously chunked_fifo is not what you're looking for and you needed to create a new type. But in that case I think the wordintrusive should be part of the new collection's name, and part of the pull request's title. If, on the other hand, "intrusive" was not very important to you - it appears that the existing chunked_fifo is actually a tiny bit more efficient than the new split_list, right?

Copy link

ContributorAuthor

xemul commentedDec 10, 2025

I wonder how much this advantage remains if the scanning code does something more than just advance but actually do a bit more work on each item - in this case the pre-fetching of F items in advance might be too early - as we still have a lot of work to do before we're ready to need the Fth next item. Also in this case the speedup of just the scan might be negligible compared to the real amount of work we need to do on the items.

It's perfectly valid concern. I will be able to answer it more reliably after I rebase#3125 onto this new container and see more real-life-ish example of tasks queue processing. There's a perf test for runqueue scan in that PR too. It runs no-op tasks, so it's still a simplification.

Do you know if something like your "split list" was ever done before?

I'm pretty sure that someone did that. At least, Redpanda says that theycoded something similar. I also tried to find more info aboutthis Avi's concen, and also found some "related" discussions, but overall no, I didn't find any deep research on that topic.

Maybe you can write a paper - or at least a blog post - on this new (?) technique :-)

I already did :) I'll send you a link

You mentioned that you were looking for the ability to "push_back and push_front" - does your intended use case (run-queue) actually use both?

Yes.reactor::add_task() does push_back andreactor::add_urgent_task() does push_front.

Of course if your goal is to be an intrusive collection, then obviously chunked_fifo is not what you're looking for and you needed to create a new type. But in that case I think the word intrusive should be part of the new collection's name, and part of the pull request's title.

Yes, definitely. Renaming of the collection is in theTODO list as well.

If, on the other hand, "intrusive" was not very important to you - it appears that the existing chunked_fifo is actually a tiny bit more efficient than the new split_list, right?

Yes, that's correct. And chunked_fifo is actuallyfaster than split-list, so we have a choice of three elements

circular_buffers -- the fastest (3.8 cycles per element), but allocates and potentially performs "large allocation"
chunked_fifo -- a bit slower (4.6 cycles per element), also allocates, but doesn't suffer from large allocation problem
split-list -- the slowest (5.2 cycles per element), but doesn't needextra memory to maintain elements (tasks)

(and a reminder, to have some reference number for those "cycles" numbers -- plain intrusive list takes 25.8 cycles per element; all numbers are for 10k elements in the collection)

Copy link

ContributorAuthor

xemul commentedDec 10, 2025

upd:

I updated the PR description with test results on i7 processor (previously I ran it on AMD threadripper). On i7 split-list beats chunked_fifo!

xemul force-pushed thebr-container-scan-perf-test branch from8acacc4 toaf69879Compare

December 10, 2025 08:40

Copy link

ContributorAuthor

xemul commentedDec 10, 2025

upd:

renamed the collection to beintrusive_split_list
re-implemented properly theoperator-> andoperator* for iterator
initialize objects in base class constructor. This allows avoiding the uglyinit_once() call and drop the manual start_measuring/stop_measuring tuning
introduced generic tests parametrization. This allows having a single scanning test function implemented and change the collection size from the command-line

Copy link

ContributorAuthor

xemul commentedDec 10, 2025

Documentation is yet to be written 😕

xemul changed the title~~Introduce split-list and show its scanning performance test~~Introduce intrusive split-list and compare its scanning performance with some other collections

Copy link

Member

avikivity commentedDec 10, 2025

Looks like chunked_fifo wins. And again I need to re-measurehttps://github.com/avikivity/seastar/commits/sort-tasks/, I found some bugs in it.

Copy link

ContributorAuthor

xemul commentedDec 10, 2025

Looks like chunked_fifo wins

Looks like it doesn't :) I measured on i7i.2xlarge instance with Xeon processor -- split-list 2.1 vs chunked-fifo 2.4

Copy link

ContributorAuthor

xemul commentedDec 10, 2025

I updated the PR description with full table

Copy link

ContributorAuthor

xemul commentedDec 10, 2025

Added (to PR description) results for Threadripper, but with different compiler (clang-20). Split-list 4.9 vs chunked-fifo 6.9

xemul added7 commits

December 10, 2025 17:38

util: Introduce intrusive split-list

e507c02

The split list is the collection of objects that's optimized forscanning from begin to end, with the ability to add entries from twoends -- front and back, like std::deque does.Splitting is needed to let CPU pre-fetch next element while scanningsome current one. With classical linked lists this is difficult, becausein order to prefetch the next element CPU needs to read the next pointerfrom the current element, which is being read. Thus the forward scan ofa plain list is serielized. With split lists several adjacent elementscan be prefetched in parallel.To achieve that, the "first" pointer is "sharded" -- there are F firstpointers (as well as F cached last pointers) and the list is populatedin round-robin manner -- first 0th list is appended, then the 1st, thenthe 2nd, ... then the Fth and population wraps around and starts over.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

test/perf: Add ability to parametrize tests

ccd8d25

Add the `--parameter name=value` vector option. Tests may then do`perf_tests::get_parameter("foo")` to get the user-configured parameterand update themselves accordingly.It will be useful for future scanning tests, as they show very differentresults on collections of different sizes, and hard-coding a test foreach interesting collection size is too much code.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

test/perf: Add vector<T> scanning test

c75be8d

This test allcoates a vector of elements containing integer, andmeasures the time it takes to scan the collection sequentially. Elementsare artificially enarged to occupy 64 bytes, so that this collectiondoesn't benefit from caching several values on read.This is basic reference test for further testing (next patches).Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

test/perf: Add array<T*> scanning test

12ceacc

Extend the vector<element> scanning test with vector<element*> scanningone. The pointers are randomized.This test works almost as fast as the base reference test does, becauseat most one element fits a cache-line, so CPU caching is effectively offhere.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

test/perf: Add boost::intrusive_slist<T> scanning test

28aa3a5

Extend the test with boost::intrusive_list<element> scan. This is toshow that such scanning, despite occupying less memory, works _slower_than base reference test. This is becase CPU cannot prefetch adjacentelements in parallel -- the next pointer needs to be read anyway.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

test/perf: Add seastar::intrusive_split_list<T> scanning test

e0c6f4f

Use split-list with the factor of 16 as it shows the nicest results onmy CPU (Ryzen Threadripper 2970WX).This test beats singly linked list greatly, and works ~2x times sloweron 10k elements that the array of pointers does. But unlike array ofpointers, it doesn't require memory allocation on appending operations.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

test/perf: Add seastar::chunked_fifo<T*> scanning test

ed08c21

Here the fifo stores pointers to elements, to behave like array ofpointers does, yet avoiding large (re)allocations on append.Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

xemul force-pushed thebr-container-scan-perf-test branch fromaf69879 toed08c21Compare

December 10, 2025 14:38

Copy link

ContributorAuthor

xemul commentedDec 10, 2025

upd:

addedsize_t intrusive_split_list::size() needed to implement task queue using this collection

nyh requested changes

		}

		size_tsize()constnoexcept {
		return _tail_idx - _head_idx;

Copy link

Contributor

nyhDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I just realized that in your implementation _tail_idx and _head_idx can keep growing as the queue is written and read over a long period of time, and wrap around the 32-bit unsigned. This is fine, almost, except for one thing: the "%" operation doesn't wrap around as expected when F is not a power of two. For example, consider F=5.
If _tail_idx reaches the last unsigned integer, _tail_idx=4294967295, and _tail_idx%5 = 0. Then we wrap around to _tail_idx=0, and _tail_idx%5 = 0 again!

So I think you need the static_assert that F is a power of two - not just for performance but also for correctness.

Copy link

ContributorAuthor

xemulDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Wow, what a catch! Thanks!
I assume circular_buffer rounds up its size to power-of-two for the same reason

Copy link

Contributor

nyhDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

By the way, with push_front setting head_idx to -1 immediately, it might be easy to write a test that reproduces this bug when F is not a power of two.

		std::array<T*, F> _first;
		std::array<T**, F> _last_p;
		unsigned _tail_idx;
		unsigned _head_idx;

Copy link

Contributor

nyhDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This "unsigned" limits the the number of elements in the list to 2^32 - and pushing more than this will, if I understand correctly, fail silently. I don't think this limit will bother us, but perhaps it should be documented at least.

nyh reviewed


		public:
		iterator(unsigned idx)noexcept
		: _cur({nullptr})

Copy link

Contributor

nyhDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

By the way, I think the {nullptr} here is confusing on what it really does -
I think (but correct me if I'm wrong) that it only initializes the first entry to nullptr explicitly, and then by default initializes the other ones to nullptr as well. So I think just {} would do exactly the same thing.

Copy link

ContributorAuthor

xemulDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, it initializes zeroth with nullptr and the rest with defaults (nullptrs as well)
It's a old-C-way of zeroing-out an array
I'll change it to{}

mitso23 reviewed

Copy link

mitso23 left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking at the performance metrics, it appears that split_list and chunked_fifo show very similar performance - with chunked_fifo actually being slightly faster in some configurations (e.g., AMD Ryzen + GCC-15).

Would it be worth considering chunked_fifo as a first iteration? It's already well-established in Seastar and might solve the performance problem you're targeting without introducing a new data structure.

Additionally, it would be helpful to include circular_buffer in the benchmarks, since that's the current implementation for the scheduler run-queue. This would provide a clearer picture of the actual improvement.

I assume the main concern with circular_buffer is the potential reallocation when it reaches max capacity - is that correct?

Copy link

ContributorAuthor

xemul commentedDec 11, 2025

Would it be worth considering chunked_fifo as a first iteration?

It's better to discuss it in#3125 , because that's the place where the new collection is actually used. This PR just adds some benchmarks. However ...

I assume the main concern with circular_buffer is the potential reallocation when it reaches max capacity - is that correct?

Not only. It's other main concern is that it allocates at all, thus making task wakeup potentially failing operation. However, I agree, that the problem is amortized, the buffer is never shrank back, so once it passes a wake-up storm, it survives until next, larger tsunami.

Additionally, it would be helpful to include circular_buffer in the benchmarks, since that's the current implementation for the scheduler run-queue. This would provide a clearer picture of the actual improvement.

Agree. Actually, the purpose of this test was to show the difference in handling different memory layouts, and circular_buffer is layed out as vector. But I think that testing differentimplementations of the same layout is worthwhile.

Copy link

Member

avikivity commentedDec 11, 2025

My feeling is that this thing is more sensitive to different workloads than chunked_fifo.

A run with very small+fast tasks will prefer large F, so that prefetch can happen in time.
A run with large+slow tasks will prefer small F, so there is smaller cache footprint and less chance we lost the data structure roots from the cache.

I believe the second effect is much less important so we'd choose a reasonably large F (say, 8 -- what did you measure with?).

A different objection is that I'd like to pursue my sort-tasks thing which un-FIFOs the task queue. But it needs more work.

Copy link

Member

avikivity commentedDec 11, 2025

Would it be worth considering chunked_fifo as a first iteration?
It's better to discuss it in#3125 , because that's the place where the new collection is actually used. This PR just adds some benchmarks. However ...
I assume the main concern with circular_buffer is the potential reallocation when it reaches max capacity - is that correct?
Not only. It's other main concern is that it allocates at all, thus making task wakeup potentially failing operation. However, I agree, that the problem is amortized, the buffer is never shrank back, so once it passes a wake-up storm, it survives until next, larger tsunami.

The tasks themselves are allocated; if we can allocate a task, we can allocate a chunk for chunked_fifo.

Copy link

ContributorAuthor

xemul commentedDec 11, 2025

what did you measure with?

mitso23 reviewed

		classintrusive_split_list {
		std::array<T*, F> _first;
		std::array<T**, F> _last_p;
		unsigned _tail_idx;

Copy link

mitso23Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why not using 64 bit unsigned int, then you don't really need to worry about the tail wrapping

Copy link

ContributorAuthor

xemulDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Head wraps anyway (immediately after push_front() that happens before pop_front()). And 32 bits is more than enough for typical use-cases, seastar shard may have at most 64Gb, so 32-bit index is enough for 16-bytes objects, I don't think it's a problem

tests/perf/container_perf.cc

		template<typename T,unsigned F, T* T::*Member>
		classintrusive_split_list {
		std::array<T*, F> _first;
		std::array<T**, F> _last_p;

Copy link

mitso23Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Wouldn't it be more readable to use dummy nodes as head of link list, you would't need then to use T**

Copy link

ContributorAuthor

xemulDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

T can be an abstract type, likeclass task

Copy link

mitso23Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I see

Copy link

ContributorAuthor

xemulDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also it's crucial to havefirst pointers packed as tight as possible to be cached on read, with dummy nodes it's not going to work as nice

		}
		};

		PERF_TEST_F(sum_perf, sum_plain) {

Copy link

mitso23Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Out of curiosity why are not using google benchmark, what are the benefits of using seastar benchmark.

Copy link

ContributorAuthor

xemulDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just because there's seastar perf test suite in the repo :)

tests/unit/split_list_test.cc

		// checks that forward scanning reveals consistent results
		BOOST_AUTO_TEST_CASE(test_split_list_grow_and_shrink) {
		constauto seed =std::random_device()();
		auto rnd_engine =std::mt19937(seed);

Copy link

mitso23Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You should also test iterator logic and maybe wrap around logic if you want to use 32bit unsigned integers for head/tail.

Copy link

ContributorAuthor

xemulDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Iterator logic is checked with

       }elseif (op ==2) {// scanunsigned scanned =0;fmt::print("= {} ... {}\n", first, last);for (auto it = sl.begin(); it != sl.end(); ++it) {fmt::print("  {}\n", it->value);BOOST_REQUIRE_EQUAL(it->value, first +1 + scanned);                scanned++;            }BOOST_REQUIRE_EQUAL(scanned, last - first);

xemul mentioned this pull request