Movatterモバイル変換


[0]ホーム

URL:


Home

PLF C++ Library - plf::list

Introduction

plf::list is a drop-in higher-performance replacement for std::list, with(on average) *:

Like std::list, plf::list maintains stable pointer/iterator links to listelements regardless of erasure, insertion, merging, splicing, reversing andsorting. It's functionality replicates std::list with two minor exceptions: theremoval of single-element and ranged splices when the source is not *this (splicing whole lists from source into *this is still available)and inclusion of several useful functions: reserve(), shrink_to_fit(),trim(), memory(), and overloads for single-element and ranged splice without a source list argument (source must always be *this). Non-member function overloads for std::erase, std::erase_if and std::swap are also implemented.

It achieves it's speed via the following:

  1. Unrolling the list ie. storing list nodes in memory blocks rather than allocating each individually.
  2. Maintaining per-block free lists of erased nodes and searching in an expansion pattern for non-empty free lists upon reinsertion to a given location.
  3. My pointer-array sort hack.

The first increases insertion and erasure speed since individual per-elementallocations and deallocations are no longer necessary, and also increases thespeed of iteration due to increased element contiguousness in memory.
The second re-uses erased element memory locations, but in a way which promoteselement contiguousness (positioning the inserted element as close to theinsertion position as possible in memory). Several other methods were trialed,but this had the best balance of reinsertion speed and subsequent iterationspeed.
The third is a simple method which allows any linked list to usequicksort-style internal sorting rather than the slower mergesort-basedinternal sorting typically utilized, but without invalidatingpointers/iterators to elements and without the disadvantage of slower sortspeed for larger-than-scalar types typically evidenced with a quicksortmethod.

* These results are not achievable with std::list +an allocator for the reasons describedhere. Results arefrom benchmarks performed on a haswell-based CPU under GCC 8.1. Averagepercentages are obtained by testing with numbers of elements ranging from 10 to1000000, and for the use-case, insertion, erasure and iteration benchmarks,from five different type sizes from char to a very large struct.

Motivation

Linked lists have many useful characteristics: they splice together withoutlosing pointer validity to contained elements, they sort quickly for large types,and their erasure speed is very good. Unfortunately the base implementation ofmost linked lists disallows decent performance for insertion - specifically,most linked lists allocate per-node, which in the modern age is particularlyslow. The C++ std::list spec disallows clustered allocations by allowingpartial list splicing - ie. if a block of nodes is allocated instead of one,the block cannot be split without invalidating pointers to nodes within theblock. This inability to allocate multiple nodes at once has a negative effect on insertion, iteration and erasure due to cacheeffects and the increased number of allocation/deallocation calls.

plf::list is a re-think from the ground-up of how to structure a linked listaround modern hardware. This means grouped allocations, re-use of memory andsort modification. The result is not something which can compete with vectorfor iteration performance, but which easily beats it for non-back insertion and erasurespeeds, and increases it's competitiveness with aspects like sorting. It isdesigned as a drop-in replacement for std::list, as well as a demonstration ofhow lists can perform outside of the C++ specification's requirement of allowing partial-listsplicing.

plf::list is designed for better push_back() performance than push_front()performance, in keeping with the most typical container usage patterns.

Implementation

Per-memory-block free list usage:

As detailed earlier, extensive testing showed block sizes of 2048 elementsto be optimal across types for balancing the reinsertion mechanism - what thismeans is that, for any given block, the free list may contain element locationsin any area of that block, and, the location popped off the top of the list tobe used for reinsertion, may not be the location closest to the insertionposition (out of all locations in the free list). But if we decrease the blocksize, we increase the probability of any free list location being closer to theinsertion position. However making the block size too low results in lowerediteration speeds, as fewer elements are contiguous.

Other methods were tested for the reinsertion mechanism, with the dual aimof quick reinsertion and finding a reinsertion location as close to theinsertion position as possible. In no particular order they are:

  1. Reversed jump-counting skipfield notation. It is possible to use a high-complexity jump-counting skipfield to find the nearest non-skipped element from any given skipped element. The details of this are too complex to be discussed here but may be found in thefull (but innaccurate and out-of-date) version of the jump-counting skipfield paper under "As a measure of comparitive 1-dimensional distance". In this context we used the skipfield in reverse, notating erased element locations as unskipped and non-erased element locations as skipped. From this we could find the nearest erased element location to the insertion position (non-erased element) very quickly. But unfortunately, the time taken to update the jump-counting skipfield upon insertion and erasure was too long, and this was the slowest method.
  2. Global free list. This method has been used before in some other linked list implementations (including a javascript one). You use a global free list of all erased element locations across all memory blocks within the linked list, and simply pop a location from the top of the list when re-inserting. This method has several failings - the first is that upon removal of a memory block, one must process the entire free list to remove all erased locations from that block. For large blocks, this can be pretty slow. Secondly, there is no way of ensuring that the selected location is close to the insertion position, other than processing the entire free list, which again is slow. This means that inevitably you end up with large memory distances between two linked nodes in the linked list, which is bad for iterative performance.
  3. Per-memory-block free lists. This works by having separate free lists for each memory block. In the event that the free list for the memory block that the insertion position is in, is empty, the mechanism radiates outward, checking the free lists for the right and left memory blocks in sequence until a non-empty free list is found. While this method does not necessarily guarantee that the chosen location will be as close as possible to the insertion position, it is the fastest overall, and, given an appropriate block size, the effect of the memory distance from the insertion position is mitigated to an extent.

As mentioned, the latter is the chosen method - however there were severalsub-methods of this which did not get selected. The first was to process theselected free list to find the location within that free list closest to theinsertion position. This did not result in a net gain when benchmarked withsubsequent iterations balanced against the additional reinsertion cost. Thesecond sub-method was searching from the last memory block in the list to thefirst sequentially instead of radiating outwards from the insertion position'sblock - again this was slower, due to the likelyhood of finding a locationfurther away in memory than with the radiant pattern.

The reason why the radiant pattern works best is because in terms ofiteration, if you find a location in as close a memory block as possible, youincrease the overall statistical likelihood that the linked list sequence willalso follow the sequence of the memory blocks in memory (provided the blocksare allocated close to each other, this will affect cache locality).

The pointer-array sort hack:

Normally it is impossible for a container with a bidirectional iterator(such as a list) to utilize std::sort, as it requires a random-access iterator.This method allows any container to use std::sort (or any sort technique thatrequires random-access iterators). It has a memory cost of N pointers over anormal list sort - one pointer for each element in the container. Each elementhas it's node's memory address inserted into an array - the order of theinsertion is not important. Hence, while with a traditional linked list thelist must be iterated over to add the addresses to the array, with plf::listthe nodes of all memory blocks can be processed sequentially in order, speedingup the process of filling the array.

Once this array of N pointers is filled with the addresses of all activenodes in the list, we then use a templated dereferencing functor in order tosort these pointers based on the value of the elements within the nodes thatthey point to:

template <class comparison_function>
struct sort_dereferencer
{
comparison_function stored_instance;

sort_dereferencer(const comparison_function &function_instance):
stored_instance(function_instance)
{}

sort_dereferencer()
{}

bool operator() (const node_pointer_type first, const node_pointer_type second)const
{
return stored_instance(first->element, second->element);
}
};

This function object takes whatever sort function is supplied to the sortoperation, and uses it to sort the pointers instead of the elements. Since anarray is a linear structure and can hence utilize a random-access iterator, wecan now sort the array of pointers using std::sort (or anytechnique), the function object above and whichever sort function issupplied.

Once the pointers are sorted according to the sort function and the elementsthemselves, the final process is simple;

  1. Process array sequentially, for every pointer at index i, dereference to node and set previous and next pointers to the addresses of i - 1 and i + 1 respectively.
  2. Set the node pointed to by array index 0 as the front node, and the node pointed to by the last pointer in the array as the back node.

The resultant sort process is roughly 72% faster than std::list's sort, whenaveraged across all datatypes and numbers of elements. It is not, unlikemergesort, stable (in this context, stable means two contiguous objects withthe same sort value will be in the same order once sorted), because std::sortis not guaranteed to be stable, but std::stable_sort can be used instead ifthis is important.

Inner structure:

The internal structure of plf::list is as follows: a custom vector holds theindividual memory blocks that the elements are stored within, plus each memoryblock's metadata (this combined with the memory block is termed a 'group').Part of the metadata is each block's 'free list' of erased memory locations.When a memory block becomes empty of elements it is either freed to the OS ormoved to the back of the vector for subsequent re-use in future. Theimplementation defines the protocol for deciding whether or not to free theblock, but in my implementation the decision is based on whether or not thememory block is of maximum size, and also whether it is close to the back ofthe vector.

When an insertion occurs, if there are no erased locations the new elementis pushed to the next available element location in the last memory block andthen linked into insertion position. Otherwise the free list of the memoryblock which the insertion position is within is checked - if empty, we checkeach memory block to the left and right in sequence until a non-empty free listis found. The selected non-empty free list has it's top element popped off andre-used.

To avoid double-destruction of elements when clearing or destroying aplf::list, element nodes which are added to their memory block's free list havetheir 'next' pointer set to NULL - since free lists only utilize the previouspointer, and non-freelist node pointers cannot be == NULL, this has no effecton operation. When a plf::list is cleared or destroyed, instead of iteratingover the list normally, we can more quickly iterate directly over the memoryblocks themselves, destroying any element of any node where the next pointer is!= NULL. The reason this is faster is because in this method, each subsequentnode is contiguous in memory while iterating, compared to a regular listtraversal which can jump all over the place.

License

plf::list is under a permissivezlib license. This means:Software is used on 'as-is' basis. It can be modified and used in commercialsoftware. Authors are not liable for any damages arising from its use. Thedistribution of a modified version of the software is subject to the followingrestrictions:

Download

Downloadhere or viewtherepository

The list library is a simple .h header file, to be included with a #includecommand.

In addition if you are interested in benchmarking you can also download theplf benchmark suite.

Function Descriptions and syntax

plf::list meets the technical specifications and has the same functions anditerator functions asstd::list, with theexception of the removal of range or single-element splices between plf::list's. Only full list (ie. all element) splices between plf::list's areallowed. Range and single-element spliceswithin a plf::list (ie. source == *this)are allowed and are detailed below. In addition, plf::list does not guarantee that contiguous and equalelements will be in the same order after sorting as std::list does (ie. thesort style is not a 'stable'-type). This is due to the way that plf::list::sort() gathers element pointers in the 'gather' phase of the algorithm (see above).

To specify an alternative sort technique, #define PLF_SORT_FUNCTION as the name of the sort function you wish to use.The sort function in question must take the same parameters as std::sort. An example would be#define PLF_SORT_FUNCTION std::stable_sort. This macro must be defined before plf_list.h is #include'd.

Additional functions specific to plf::list follow:

Benchmarks

Benchmark results for plf::list v1.85 under GCC 9.2 x64 on an Intel XeonE3-1241 (Haswell) are availablehere.
Some results for an early version of plf::list under GCC 5.1 x64 on an IntelE8500 (Core2) arehere.

Frequently-asked Questions

Version history

Contact:footer
plf:: library and this page Copyright (c) 2024, Matthew Bentley


[8]ページ先頭

©2009-2025 Movatter.jp