Rejected based on Raymond Hettinger’s sage advice[4]:
After looking at the source, I think this has almost zero chancefor replacing list(). There is too much value in a simple C API,low space overhead for small lists, good performance is common usecases, and having performance that is easily understood. TheBList implementation lacks these virtues and it trades-off a littleperformance in common cases in for much better performance inuncommon cases. As a Py3.0 PEP, I think it can be rejected.Depending on its success as a third-party module, it still has achance for inclusion in the collections module. The essentialcriteria for that is whether it is a superior choice for somereal-world use cases. I’ve scanned my own code and found no instanceswhere BList would have been preferable to a regular list. However,that scan has a selection bias because it doesn’t reflect what I wouldhave written had BList been available. So, after a few months, Iintend to poll comp.lang.python for BList success stories. If theyexist, then I have no problem with inclusion in the collectionsmodule. After all, its learning curve is near zero – the only costis the clutter factor stemming from indecision about the mostappropriate data structure for a given task.
The common case for list operations is on small lists. The currentarray-based list implementation excels at small lists due to thestrong locality of reference and infrequency of memory allocationoperations. However, an array takes O(n) time to insert and deleteelements, which can become problematic as the list gets large.
This PEP introduces a new data type, the BList, that has array-likeand tree-like aspects. It enjoys the same good performance on smalllists as the existing array-based implementation, but offers superiorasymptotic performance for most operations. This PEP proposesreplacing the makes two mutually exclusive proposals for including theBList type in Python:
The BList grew out of the frustration of needing to rewrite intuitivealgorithms that worked fine for small inputs but took O(n**2) time forlarge inputs due to the underlying O(n) behavior of array-based lists.The deque type, introduced in Python 2.4, solved the most commonproblem of needing a fast FIFO queue. However, the deque type doesn’thelp if we need to repeatedly insert or delete elements from themiddle of a long list.
A wide variety of data structure provide good asymptotic performancefor insertions and deletions, but they either have O(n) performancefor other operations (e.g., linked lists) or have inferior performancefor small lists (e.g., binary trees and skip lists).
The BList type proposed in this PEP is based on the principles ofB+Trees, which have array-like and tree-like aspects. The BListoffers array-like performance on small lists, while offering O(log n)asymptotic performance for all insert and delete operations.Additionally, the BList implements copy-on-write under-the-hood, soeven operations like getslice take O(log n) time. The table belowcompares the asymptotic performance of the current array-based listimplementation with the asymptotic performance of the BList.
| Operation | Array-based list | BList |
|---|---|---|
| Copy | O(n) | O(1) |
| Append | O(1) | O(log n) |
| Insert | O(n) | O(log n) |
| Get Item | O(1) | O(log n) |
| Set Item | O(1) | O(log n) |
| Del Item | O(n) | O(log n) |
| Iteration | O(n) | O(n) |
| Get Slice | O(k) | O(log n) |
| Del Slice | O(n) | O(log n) |
| Set Slice | O(n+k) | O(log k + log n) |
| Extend | O(k) | O(log k + log n) |
| Sort | O(n log n) | O(n log n) |
| Multiply | O(nk) | O(log k) |
An extensive empirical comparison of Python’s array-based list and theBList are available at[2].
The BList offers superior performance for many, but not all,operations. Choosing the correct data type for a particular use casedepends on which operations are used. Choosing the correct data typeas a built-in depends on balancing the importance of different usecases and the magnitude of the performance differences.
For the common uses cases of small lists, the array-based list and theBList have similar performance characteristics.
For the slightly less common case of large lists, there are two commonuses cases where the existing array-based list outperforms theexisting BList reference implementation. These are:
In performance tests on a 10,000 element list, BLists exhibited a 50%and 5% increase in execution time for these two uses cases,respectively.
The performance for the LIFO use case could be improved to O(n) time,by caching a pointer to the right-most leaf within the root node. Forlists that do not change size, the common case of sequential accesscould also be improved to O(n) time via caching in the root node.However, the performance of these approaches has not been empiricallytested.
Many operations exhibit a tremendous speed-up (O(n) to O(log n)) whenswitching from the array-based list to BLists. In performance testson a 10,000 element list, operations such as getslice, setslice, andFIFO-style insert and deletes on a BList take only 1% of the timeneeded on array-based lists.
In light of the large performance speed-ups for many operations, thesmall performance costs for some operations will be worthwhile formany (but not all) applications.
The BList is based on the B+Tree data structure. The BList is a wide,bushy tree where each node contains an array of up to 128 pointers toits children. If the node is a leaf, its children are theuser-visible objects that the user has placed in the list. If node isnot a leaf, its children are other BList nodes that are notuser-visible. If the list contains only a few elements, they will allbe a children of single node that is both the root and a leaf. Sincea node is little more than array of pointers, small lists operate ineffectively the same way as an array-based data type and share thesame good performance characteristics.
The BList maintains a few invariants to ensure good (O(log n))asymptotic performance regardless of the sequence of insert and deleteoperations. The principle invariants are as follows:
If an insert would cause a node to exceed 128 children, the nodespawns a sibling and transfers half of its children to the sibling.The sibling is inserted into the node’s parent. If the node is theroot node (and thus has no parent), a new parent is created and thedepth of the tree increases by one.
If a deletion would cause a node to have fewer than 64 children, thenode moves elements from one of its siblings if possible. If both ofits siblings also only have 64 children, then two of the nodes mergeand the empty one is removed from its parent. If the root node isreduced to only one child, its single child becomes the new root(i.e., the depth of the tree is reduced by one).
In addition to tree-like asymptotic performance and array-likeperformance on small-lists, BLists support transparentcopy-on-write. If a non-root node needs to be copied (as part ofa getslice, copy, setslice, etc.), the node is shared between multipleparents instead of being copied. If it needs to be modified later, itwill be copied at that time. This is completely behind-the-scenes;from the user’s point of view, the BList works just like a regularPython list.
In the worst case, the leaf nodes of a BList have only 64 childreneach, rather than a full 128, meaning that memory usage is aroundtwice that of a best-case array implementation. Non-leaf nodes use upa negligible amount of additional memory, since there are at least 63times as many leaf nodes as non-leaf nodes.
The existing array-based list implementation must grow and shrink asitems are added and removed. To be efficient, it grows and shrinksonly when the list has grow or shrunk exponentially. In the worstcase, it, too, uses twice as much memory as the best case.
In summary, the BList’s memory footprint is not significantlydifferent from the existing array-based implementation.
If the BList is added to the collections module, backwardscompatibility is not an issue. This section focuses on the option ofreplacing the existing array-based list with the BList. For users ofthe Python interpreter, a BList has an identical interface to thecurrent list-implementation. For virtually all operations, thebehavior is identical, aside from execution speed.
For the C API, BList has a different interface than the existinglist-implementation. Due to its more complex structure, the BListdoes not lend itself well to poking and prodding by external sources.Thankfully, the existing list-implementation defines an API offunctions and macros for accessing data from list objects. GoogleCode Search suggests that the majority of third-party modules uses thewell-defined API rather than relying on the list’s structuredirectly. The table below summarizes the search queries and results:
| Search String | Number of Results |
|---|---|
| PyList_GetItem | 2,000 |
| PySequence_GetItem | 800 |
| PySequence_Fast_GET_ITEM | 100 |
| PyList_GET_ITEM | 400 |
| [^a-zA-Z_]ob_item | 100 |
This can be achieved in one of two ways:
The macros would need to evaluate their arguments more than once,which could be a problem if the arguments have side effects. AGoogle Code Search for “PyList_GET_ITEM([^)]+(” found only ahandful of cases where this occurs, so the impact appears to below.
The few extension modules that use list’s undocumented structuredirectly, instead of using the API, would break. The core codeitself uses the accessor macros fairly consistently and should beeasy to port.
Existing modules would continue to compile and work without change,but they would need to make a deliberate (but small) effort tomigrate to the BList.
The downside of this approach is that mixing modules that useBLists and array-based lists might lead to slow down if conversionsare frequently necessary.
A reference implementations of the BList is available for CPython at[1].
The source package also includes a pure Python implementation,originally developed as a prototype for the CPython version.Naturally, the pure Python version is rather slow and the asymptoticimprovements don’t win out until the list is quite large.
When compiled with Py_DEBUG, the C implementation checks theBList invariants when entering and exiting most functions.
An extensive set of test cases is also included in the source package.The test cases include the existing Python sequence and list testcases as a subset. When the interpreter is built with Py_DEBUG, thetest cases also check for reference leaks.
If the BList is added to the collections module, other Python variantscan support it in one of three ways:
This proposal has been discussed briefly on the Python-3000 mailinglist[3]. Although a number of people favored the proposal, therewere also some objections. Below summarizes the pros and cons asobserved by posters to the thread.
General comments:
Comments on adding BList to the collections module:
Comments on replacing the array-based list with the BList:
To assess the desirability and performance in actual applications,Raymond Hettinger suggested releasing the BList as an extension module(now available at[1]). If it proves useful, he felt it would be astrong candidate for inclusion in 2.6 as part of the collectionsmodule. If widely popular, then it could be considered for replacingthe array-based list, but not otherwise.
Guido van Rossum commented that he opposed the proliferation of datatypes, but favored replacing the array-based list if backwardscompatibility could be addressed and the BList’s performance wasuniformly better.
This document has been placed in the public domain.
Source:https://github.com/python/peps/blob/main/peps/pep-3128.rst
Last modified:2025-02-01 08:59:27 GMT