oven-sh/mimallocPublic

forked frommicrosoft/mimalloc

NotificationsYou must be signed in to change notification settings
Fork4
Star5

mimalloc is a compact general purpose allocator with excellent performance.

License

MIT license

5 stars 964 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 2,860 Commits
bin		bin
cmake		cmake
doc		doc
docker		docker
docs		docs
ide		ide
include		include
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
SECURITY.md		SECURITY.md
azure-pipelines.yml		azure-pipelines.yml
mimalloc.pc.in		mimalloc.pc.in
readme.md		readme.md

Repository files navigation

mimalloc

mimalloc (pronounced "me-malloc")is a general purpose allocator with excellentperformance characteristics.Initially developed by Daan Leijen for the runtime systems of theKoka andLean languages.

Latest release tag:v2.1.7 (2024-05-21).
Latest v1 tag:v1.8.7 (2024-05-21).

mimalloc is a drop-in replacement formalloc and can be used in other programswithout code changes, for example, on dynamically linked ELF-based systems (Linux, BSD, etc.) you can use it as:

> LD_PRELOAD=/usr/lib/libmimalloc.so  myprogram

It also includes a robust way to override the default allocator inWindows. Notable aspects of the design include:

small and consistent: the library is about 8k LOC using simple andconsistent data structures. This makes it very suitableto integrate and adapt in other projects. For runtime systems itprovides hooks for a monotonicheartbeat and deferred freeing (forbounded worst-case times with reference counting).Partly due to its simplicity, mimalloc has been ported to many systems (Windows, macOS,Linux, WASM, various BSD's, Haiku, MUSL, etc) and has excellent support for dynamic overriding.At the same time, it is an industrial strength allocator that runs (very) large scaledistributed services on thousands of machines with excellent worst case latencies.
free list sharding: instead of one big free list (per size class) we havemany smaller lists per "mimalloc page" which reduces fragmentation andincreases locality --things that are allocated close in time get allocated close in memory.(A mimalloc page contains blocks of one size class and is usually 64KiB on a 64-bit system).
free list multi-sharding: the big idea! Not only do we shard the free listper mimalloc page, but for each page we have multiple free lists. In particular, thereis one list for thread-localfree operations, and another one for concurrentfreeoperations. Free-ing from another thread can now be a single CAS without needingsophisticated coordination between threads. Since there will bethousands of separate free lists, contention is naturally distributed over the heap,and the chance of contending on a single location will be low -- this is quitesimilar to randomized algorithms like skip lists where addinga random oracle removes the need for a more complex algorithm.
eager page purging: when a "page" becomes empty (with increased chancedue to free list sharding) the memory is marked to the OS as unused (reset or decommitted)reducing (real) memory pressure and fragmentation, especially in long runningprograms.
secure:mimalloc can be built in secure mode, adding guard pages,randomized allocation, encrypted free lists, etc. to protect against variousheap vulnerabilities. The performance penalty is usually around 10% on averageover our benchmarks.
first-class heaps: efficiently create and use multiple heaps to allocate across different regions.A heap can be destroyed at once instead of deallocating each object separately.
bounded: it does not suffer fromblowup [1], has bounded worst-case allocationtimes (wcat) (upto OS primitives), bounded space overhead (~0.2% meta-data, with lowinternal fragmentation), and has no internal points of contention using only atomic operations.
fast: In our benchmarks (seebelow),mimalloc outperforms other leading allocators (jemalloc,tcmalloc,Hoard, etc),and often uses less memory. A nice property is that it does consistently well over a wide rangeof benchmarks. There is also good huge OS page support for larger server programs.

Thedocumentation gives a full overview of the API.You can read more on the design ofmimalloc in thetechnical report which also has detailed benchmark results.

Enjoy!

Branches

master: latest stable release (based ondev-slice).
dev: development branch for mimalloc v1. Use this branch for submitting PR's.
dev-slice: development branch for mimalloc v2. This branch is downstream ofdev (and is essentially equal todev except forsrc/segment.c)

Releases

Note: thev2.x version has a different algorithm for managing internal mimalloc pages (as slices) that tends to use reducememory usageand fragmentation compared to mimallocv1.x (especially for large workloads). Should otherwise have similar performance(seebelow); please report if you observe any significant performance regression.

2024-05-21,v1.8.7,v2.1.7: Fix build issues on less common platforms. Started upstreaming patchesfrom the CPythonintegration. Upstreamvcpkg patches.
2024-05-13,v1.8.6,v2.1.6: Fix build errors on various (older) platforms. Refactored aligned allocation.
2024-04-22,v1.8.4,v2.1.4: Fixes various bugs and build issues. AddMI_LIBC_MUSL cmake flag for musl builds.Free-ing code is refactored into a separate module (free.c). Mimalloc page info is simplified with the block sizedirectly available (and newblock_size_shift to improve aligned block free-ing).New approach to collection of abandoned segments: Whena thread terminates the segments it owns are abandoned (containing still live objects) and these can bereclaimed by other threads. We no longer use a list of abandoned segments but this is now done using bitmaps in arena'swhich is more concurrent (and more aggressive). Abandoned memory can now also be reclaimed if a thread frees an object inan abandoned page (which can be disabled usingmi_option_abandoned_reclaim_on_free). The optionmi_option_max_segment_reclaimgives a maximum percentage of abandoned segments that can be reclaimed per try (=10%).
2023-04-24,v1.8.2,v2.1.2: Fixes build issues on freeBSD, musl, and C17 (UE 5.1.1). Reduce code size/complexityby removing regions and segment-cache's and only use arenas with improved memory purging -- this may improve memoryusage as well for larger services. Renamed options for consistency. Improved Valgrind and ASAN checking.
2023-04-03,v1.8.1,v2.1.1: Fixes build issues on some platforms.
2023-03-29,v1.8.0,v2.1.0: Improved support dynamic overriding on Windows 11. Improved tracing precisionwithasan andValgrind, and added Windows event tracingETW (contributed by Xinglong He). Created an OSabstraction layer to make it easier to port and separate platform dependent code (insrc/prim). Fixed C++ STL compilation on older Microsoft C++ compilers, and various small bug fixes.
2022-12-23,v1.7.9,v2.0.9: Supports building withasan and improvedValgrind support.Support arbitrary large alignments (in particular forstd::pmr pools).Added C++ STL allocators attached to a specific heap (thanks @vmarkovtsev).Heap walks now visit all object (including huge objects). Support Windows nano server containers (by Johannes Schindelin,@dscho).Various small bug fixes.
2022-11-03,v1.7.7,v2.0.7: Initial support forValgrind for leak testing and heap block overflowdetection. Initialsupport for attaching heaps to a speficic memory area (only in v2). Fixrealloc behavior for zero size blocks, remove restriction to integral multiple of the alignment inalloc_align, improved aligned allocation performance, reduced contention with many threads on few processors (thank you @dposluns!), vs2022 support, supportpkg-config, .
2022-04-14,v1.7.6,v2.0.6: fix fallback path for aligned OS allocation on Windows, improve Windows aligned allocationeven when compiling with older SDK's, fix dynamic overriding on macOS Monterey, fix MSVC C++ dynamic overriding, fixwarnings under Clang 14, improve performance if many OS threads are created and destroyed, fix statistics for large objectallocations, using MIMALLOC_VERBOSE=1 has no maximum on the number of error messages, various small fixes.
2022-02-14,v1.7.5,v2.0.5 (alpha): fix malloc override onWindows 11, fix compilation with musl, potentially reducedcommitted memory, addbin/minject for Windows,improved wasm support, faster aligned allocation,various small fixes.
Older release notes

Special thanks to:

David Carlier (@devnexen) for his many contributions, and makingmimalloc work better on many less common operating systems, like Haiku, Dragonfly, etc.
Mary Feofanova (@mary3000), Evgeniy Moiseenko, and Manuel Pöter (@mpoeter) for making mimalloc TSAN checkable, and findingmemory model bugs using thegenMC model checker.
Weipeng Liu (@pongba), Zhuowei Li, Junhua Wang, and Jakub Szymanski, for their early support of mimalloc and deploymentat large scale services, leading to many improvements in the mimalloc algorithms for large workloads.
Jason Gibson (@jasongibson) for exhaustive testing on large scale workloads and server environments, and finding complex bugsin (early versions of)mimalloc.
Manuel Pöter (@mpoeter) and Sam Gross(@colesbury) for finding an ABA concurrency issue in abandoned segment reclamation. Sam also created theno GIL Python fork whichuses mimalloc internally.

Usage

mimalloc is used in various large scale low-latency services and programs, for example:

Building

Windows

Openide/vs2022/mimalloc.sln in Visual Studio 2022 and build.Themimalloc project builds a static library (inout/msvc-x64), while themimalloc-override project builds a DLL for overriding mallocin the entire program.

macOS, Linux, BSD, etc.

We usecmake¹ as the build system:

> mkdir -p out/release> cd out/release> cmake ../..> make

This builds the library as a shared (dynamic)library (.so or.dylib), a static library (.a), andas a single object file (.o).

> sudo make install (install the library and header files in/usr/local/lib and/usr/local/include)

You can build the debug version which does many internal checks andmaintains detailed statistics as:

> mkdir -p out/debug> cd out/debug> cmake -DCMAKE_BUILD_TYPE=Debug ../..> make

This will name the shared library aslibmimalloc-debug.so.

Finally, you can build asecure version that uses guard pages, encryptedfree lists, etc., as:

> mkdir -p out/secure> cd out/secure> cmake -DMI_SECURE=ON ../..> make

This will name the shared library aslibmimalloc-secure.so.Useccmake² instead ofcmaketo see and customize all the available build options.

Notes:

Install CMake:sudo apt-get install cmake
Install CCMake:sudo apt-get install cmake-curses-gui

Single source

You can also directly build the singlesrc/static.c file as part of your project withoutneedingcmake at all. Make sure to also add the mimallocinclude directory to the include path.

Using the library

The preferred usage is including<mimalloc.h>, linking withthe shared- or static library, and using themi_malloc API exclusively for allocation. For example,

> gcc -o myprogram -lmimalloc myfile.c

mimalloc uses only safe OS calls (mmap andVirtualAlloc) and can co-existwith other allocators linked to the same program.If you usecmake, you can simply use:

find_package(mimalloc 1.4 REQUIRED)

in yourCMakeLists.txt to find a locally installed mimalloc. Then use either:

target_link_libraries(myapp PUBLIC mimalloc)

to link with the shared (dynamic) library, or:

target_link_libraries(myapp PUBLIC mimalloc-static)

to link with the static library. Seetest\CMakeLists.txt for an example.

For best performance in C++ programs, it is also recommended to override theglobalnew anddelete operators. For convenience, mimalloc providesmimalloc-new-delete.h which does this for you -- just include it in a single(!) source file in your project.In C++, mimalloc also provides themi_stl_allocator struct which implements thestd::allocatorinterface.

You can pass environment variables to print verbose messages (MIMALLOC_VERBOSE=1)and statistics (MIMALLOC_SHOW_STATS=1) (in the debug version):

> env MIMALLOC_SHOW_STATS=1 ./cfrac 175451865205073170563711388363175451865205073170563711388363 = 374456281610909315237213 * 468551heap stats:     peak      total      freed       unitnormal   2:    16.4 kb    17.5 mb    17.5 mb      16 b   oknormal   3:    16.3 kb    15.2 mb    15.2 mb      24 b   oknormal   4:      64 b      4.6 kb     4.6 kb      32 b   oknormal   5:      80 b    118.4 kb   118.4 kb      40 b   oknormal   6:      48 b       48 b       48 b       48 b   oknormal  17:     960 b      960 b      960 b      320 b   okheap stats:     peak      total      freed       unit    normal:    33.9 kb    32.8 mb    32.8 mb       1 b   ok      huge:       0 b        0 b        0 b        1 b   ok     total:    33.9 kb    32.8 mb    32.8 mb       1 b   okmalloc requested:         32.8 mb committed:    58.2 kb    58.2 kb    58.2 kb       1 b   ok  reserved:     2.0 mb     2.0 mb     2.0 mb       1 b   ok     reset:       0 b        0 b        0 b        1 b   ok  segments:       1          1          1-abandoned:       0     pages:       6          6          6-abandoned:       0     mmaps:       3 mmap fast:       0 mmap slow:       1   threads:       0   elapsed:     2.022s   process: user: 1.781s, system: 0.016s, faults: 756, reclaims: 0, rss: 2.7 mb

The above model of using themi_ prefixed API is not always possiblethough in existing programs that already use the standard malloc interface,and another option is to override the standard malloc interfacecompletely and redirect all calls to themimalloc library instead .

Environment Options

You can set further options either programmatically (usingmi_option_set), or via environment variables:

MIMALLOC_SHOW_STATS=1: show statistics when the program terminates.
MIMALLOC_VERBOSE=1: show verbose messages.
MIMALLOC_SHOW_ERRORS=1: show error and warning messages.

Advanced options:

MIMALLOC_ARENA_EAGER_COMMIT=2: turns on eager commit for the large arenas (usually 1GiB) from which mimallocallocates segments and pages. Set this to 2 (default) toonly enable this on overcommit systems (e.g. Linux). Set this to 1 to enable explicitly on other systemsas well (like Windows or macOS) which may improve performance (as the whole arena is committed at once).Note that eager commit only increases the commit but not the actual the peak resident set(rss) so it is generally ok to enable this.
MIMALLOC_PURGE_DELAY=N: the delay inN milli-seconds (by default10) after which mimalloc will purgeOS pages that are not in use. This signals to the OS that the underlying physical memory can be reused whichcan reduce memory fragmentation especially in long running (server) programs. SettingN to0 purges immediately whena page becomes unused which can improve memory usage but also decreases performance. SettingN to a highervalue like100 can improve performance (sometimes by a lot) at the cost of potentially using more memory at times.Setting it to-1 disables purging completely.
MIMALLOC_PURGE_DECOMMITS=1: By default "purging" memory means unused memory is decommitted (MEM_DECOMMIT on Windows,MADV_DONTNEED (which decresease rss immediately) onmmap systems). Set this to 0 to instead "reset" unusedmemory on a purge (MEM_RESET on Windows, generallyMADV_FREE (which does not decrease rss immediately) onmmap systems).Mimalloc generally does not "free" OS memory but only "purges" OS memory, in other words, it tries to keep virtualaddress ranges and decommits within those ranges (to make the underlying physical memory available to other processes).

Further options for large workloads and services:

MIMALLOC_USE_NUMA_NODES=N: pretend there are at mostN NUMA nodes. If not set, the actual NUMA nodes are detectedat runtime. SettingN to 1 may avoid problems in some virtual environments. Also, setting it to a lower number thanthe actual NUMA nodes is fine and will only cause threads to potentially allocate more memory across actual NUMAnodes (but this can happen in any case as NUMA local allocation is always a best effort but not guaranteed).
MIMALLOC_ALLOW_LARGE_OS_PAGES=1: use large OS pages (2 or 4MiB) when available; for some workloads this can significantlyimprove performance. When this option is disabled, it also disables transparent huge pages (THP) for the process(on Linux and Android). UseMIMALLOC_VERBOSE to check if the large OS pages are enabled -- usually one needsto explicitly give permissions for large OS pages (as onWindows andLinux). However, sometimesthe OS is very slow to reserve contiguous physical memory for large OS pages so use with care on systems thatcan have fragmented memory (for that reason, we generally recommend to useMIMALLOC_RESERVE_HUGE_OS_PAGES instead whenever possible).
MIMALLOC_RESERVE_HUGE_OS_PAGES=N: whereN is the number of 1GiBhuge OS pages. This reserves the huge pages atstartup and sometimes this can give a large (latency) performance improvement on big workloads.Usually it is better to not useMIMALLOC_ALLOW_LARGE_OS_PAGES=1 in combination with this setting. Just like largeOS pages, use with care as reservingcontiguous physical memory can take a long time when memory is fragmented (but reserving the huge pages is done atstartup only once).Note that we usually need to explicitly give permission for huge OS pages (as onWindows andLinux)).With huge OS pages, it may be beneficial to set the settingMIMALLOC_EAGER_COMMIT_DELAY=N (N is 1 by default) to delay the initialN segments (of 4MiB)of a thread to not allocate in the huge OS pages; this prevents threads that are short livedand allocate just a little to take up space in the huge OS page area (which cannot be purged as huge OS pages are pinnedto physical memory).The huge pages are usually allocated evenly among NUMA nodes.We can useMIMALLOC_RESERVE_HUGE_OS_PAGES_AT=N whereN is the numa node (starting at 0) to allocate allthe huge pages at a specific numa node instead.

Use caution when usingfork in combination with either large or huge OS pages: on a fork, the OS uses copy-on-writefor all pages in the original process including the huge OS pages. When any memory is now written in that area, theOS will copy the entire 1GiB huge page (or 2MiB large page) which can cause the memory usage to grow in large increments.

Secure Mode

mimalloc can be build in secure mode by using the-DMI_SECURE=ON flags incmake. This build enables various mitigationsto make mimalloc more robust against exploits. In particular:

All internal mimalloc pages are surrounded by guard pages and the heap metadata is behind a guard page as well (so a buffer overflowexploit cannot reach into the metadata).
All free list pointers areencodedwith per-page keys which is used both to prevent overwrites with a known pointer, as well as to detect heap corruption.
Double free's are detected (and ignored).
The free lists are initialized in a random order and allocation randomly chooses between extension and reuse within a page tomitigate against attacks that rely on a predicable allocation order. Similarly, the larger heap blocks allocated by mimallocfrom the OS are also address randomized.

As always, evaluate with care as part of an overall security strategy as all of the above are mitigations but not guarantees.

Debug Mode

Whenmimalloc is built using debug mode, various checks are done at runtime to catch development errors.

Statistics are maintained in detail for each object size. They can be shown usingMIMALLOC_SHOW_STATS=1 at runtime.
All objects have padding at the end to detect (byte precise) heap block overflows.
Double free's, and freeing invalid heap pointers are detected.
Corrupted free-lists and some forms of use-after-free are detected.

Overriding Standard Malloc

Overriding the standardmalloc (andnew) can be done eitherdynamically orstatically.

Dynamic override

This is the recommended way to override the standard malloc interface.

Dynamic Override on Linux, BSD

On these ELF-based systems we preload the mimalloc sharedlibrary so all calls to the standardmalloc interface areresolved to themimalloc library.

> env LD_PRELOAD=/usr/lib/libmimalloc.so myprogram

You can set extra environment variables to check that mimalloc is running,like:

> env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/libmimalloc.so myprogram

or run with the debug version to get detailed statistics:

> env MIMALLOC_SHOW_STATS=1 LD_PRELOAD=/usr/lib/libmimalloc-debug.so myprogram

Dynamic Override on MacOS

On macOS we can also preload the mimalloc sharedlibrary so all calls to the standardmalloc interface areresolved to themimalloc library.

> env DYLD_INSERT_LIBRARIES=/usr/lib/libmimalloc.dylib myprogram

Note that certain security restrictions may apply when doing this fromtheshell.

Dynamic Override on Windows

Dynamically overriding on mimalloc on Windowsis robust and has the particular advantage to be able to redirect all malloc/free calls that go throughthe (dynamic) C runtime allocator, including those from other DLL's or libraries.As it intercepts all allocation calls on a low level, it can be used reliablyon large programs that include other 3rd party components.There are four requirements to make the overriding work robustly:

Use the C-runtime library as a DLL (using the/MD or/MDd switch).
Link your program explicitly withmimalloc-override.dll library.To ensure themimalloc-override.dll is loaded at run-time it is easiest to insert somecall to the mimalloc API in themain function, likemi_version()(or use the/INCLUDE:mi_version switch on the linker). See themimalloc-override-test projectfor an example on how to use this.
Themimalloc-redirect.dll (ormimalloc-redirect32.dll) must be putin the same folder as the mainmimalloc-override.dll at runtime (as it is a dependency of that DLL).The redirection DLL ensures that all calls to the C runtime malloc API get redirected tomimalloc functions (which reside inmimalloc-override.dll).
Ensure themimalloc-override.dll comes as early as possible in the importlist of the final executable (so it can intercept all potential allocations).

For best performance on Windows with C++, itis also recommended to also override thenew/delete operations (by includingmimalloc-new-delete.ha single(!) source file in your project).

The environment variableMIMALLOC_DISABLE_REDIRECT=1 can be used to disable dynamicoverriding at run-time. UseMIMALLOC_VERBOSE=1 to check if mimalloc was successfully redirected.

We cannot always re-link an executable withmimalloc-override.dll, and similarly, we cannot alwaysensure the the DLL comes first in the import table of the final executable.In many cases though we can patch existing executables without any recompilationif they are linked with the dynamic C runtime (ucrtbase.dll) -- just put themimalloc-override.dllinto the import table (and putmimalloc-redirect.dll in the same folder)Such patching can be done for example withCFF Explorer ortheminject program.

Static override

On Unix-like systems, you can also statically link withmimalloc to override the standardmalloc interface. The recommended way is to link the final program with themimalloc single object file (mimalloc.o). We usean object file instead of a library file as linkers give preference tothat over archives to resolve symbols. To ensure that the standardmalloc interface resolves to themimalloc library, link it as the firstobject file. For example:

> gcc -o myprogram mimalloc.o  myfile1.c ...

Another way to override statically that works on all platforms, is tolink statically to mimalloc (as shown in the introduction) and include aheader file in each source file that re-definesmalloc etc. tomi_malloc.This is provided bymimalloc-override.h. This only works reliably though if all sources areunder your control or otherwise mixing of pointers from different heaps may occur!

Tools

Generally, we recommend using the standard allocator with memory tracking tools, but mimalloccan also be build to support theaddress sanitizer or the excellentValgrind tool.Moreover, it can be build to support Windows event tracing (ETW).This has a small performance overhead but does allow detecting memory leaks and byte-precisebuffer overflows directly on final executables. See also thetest/test-wrong.c file to test with various tools.

Valgrind

To build withvalgrind support, use theMI_TRACK_VALGRIND=ON cmake option:

> cmake ../.. -DMI_TRACK_VALGRIND=ON

This can also be combined with secure mode or debug mode.You can then run your programs directly under valgrind:

> valgrind <myprogram>

If you rely on overridingmalloc/free by mimalloc (instead of using themi_malloc/mi_free API directly),you also need to tellvalgrind to not intercept those calls itself, and use:

> MIMALLOC_SHOW_STATS=1 valgrind  --soname-synonyms=somalloc=*mimalloc* -- <myprogram>

By setting theMIMALLOC_SHOW_STATS environment variable you can check that mimalloc is indeedused and not the standard allocator. Even though theValgrind optionis called--soname-synonyms, this alsoworks when overriding with a static library or object file. Unfortunately, it is not possible todynamically override mimalloc usingLD_PRELOAD together withvalgrind.See also thetest/test-wrong.c file to test withvalgrind.

Valgrind support is in its initial development -- please report any issues.

ASAN

To build with the address sanitizer, use the-DMI_TRACK_ASAN=ON cmake option:

> cmake ../.. -DMI_TRACK_ASAN=ON

This can also be combined with secure mode or debug mode.You can then run your programs as:'

> ASAN_OPTIONS=verbosity=1 <myprogram>

When you link a program with an address sanitizer build of mimalloc, you shouldgenerally compile that program too with the address sanitizer enabled.For example, assuming you build mimalloc inout/debug:

clang -g -o test-wrong -Iinclude test/test-wrong.c out/debug/libmimalloc-asan-debug.a -lpthread -fsanitize=address -fsanitize-recover=address

Since the address sanitizer redirects the standard allocation functions, on some platforms (macOSX for example)it is required to compile mimalloc with-DMI_OVERRIDE=OFF.Adress sanitizer support is in its initial development -- please report any issues.

ETW

Event tracing for Windows (ETW) provides a high performance way to capture all allocations thoughmimalloc and analyze them later. To build with ETW support, use the-DMI_TRACK_ETW=ON cmake option.

You can then capture an allocation trace using the Windows performance recorder (WPR), using thesrc/prim/windows/etw-mimalloc.wprp profile. In an admin prompt, you can use:

> wpr -start src\prim\windows\etw-mimalloc.wprp -filemode> <my_mimalloc_program>> wpr -stop <my_mimalloc_program>.etl

and then open<my_mimalloc_program>.etl in the Windows Performance Analyzer (WPA), oruse a tool likeTraceControl that is specialized for analyzing mimalloc traces.

Performance

Last update: 2021-01-30

We testedmimalloc against many other top allocators over a widerange of benchmarks, ranging from various real world programs tosynthetic benchmarks that see how the allocator behaves under moreextreme circumstances. In our benchmark suite,mimalloc outperforms other leadingallocators (jemalloc,tcmalloc,Hoard, etc), and has a similar memory footprint. A nice property is that itdoes consistently well over the wide range of benchmarks.

General memory allocators are interesting as there exists no algorithm that isoptimal -- for a given allocator one can usually construct a workloadwhere it does not do so well. The goal is thus to find an allocationstrategy that performs well over a wide range of benchmarks withoutsuffering from (too much) underperformance in less common situations.

As always, interpret these results with care since some benchmarks test syntheticor uncommon situations that may never apply to your workloads. For example, mostallocators do not do well onxmalloc-testN but that includes even the bestindustrial allocators likejemalloc andtcmalloc that are used in some ofthe world's largest systems (like Chrome or FreeBSD).

Also, the benchmarks here do not measure the behaviour on very large and long-running server workloads,or worst-case latencies of allocation. Much work has gone intomimalloc to work well on suchworkloads (for example, to reduce virtual memory fragmentation on long-running services)but such optimizations are not always reflected in the current benchmark suite.

We show here only an overview -- formore specific details and further benchmarks we refer to thetechnical report.The benchmark suite is automated and available separatelyasmimalloc-bench.

Benchmark Results on a 16-core AMD 5950x (Zen3)

Testing on the 16-core AMD 5950x processor at 3.4Ghz (4.9Ghz boost), withwith 32GiB memory at 3600Mhz, runningUbuntu 20.04 with glibc 2.31 and GCC 9.3.0.

We measure three versions ofmimalloc: the main versionmi (tag:v1.7.0),the new v2.0 beta version asxmi (tag:v2.0.0), and the main version in secure mode assmi (tag:v1.7.0).

The other allocators areGoogle'stcmalloc (tc, tag:gperftools-2.8.1) used in Chrome,Facebook'sjemalloc (je, tag:5.2.1) by Jason Evans used in Firefox and FreeBSD,the Intel thread building blocksallocator (tbb, tag:v2020.3),rpmalloc (rp,tag:1.4.1) by Mattias Jansson,the original scalableHoard (git:d880f72) allocator by Emery Berger [1],the memory compactingMesh (git:67ff31a) allocator byBobby Powerset al [8],and finally the default system allocator (glibc, 2.31) (based onPtMalloc2).

Any benchmarks ending inN run on all 32 logical cores in parallel.Results are averaged over 10 runs and reported relativeto mimalloc (where 1.2 means it took 1.2× longer to run).The legend also contains theoverall relative score between theallocators where 100 points is the maximum if an allocator is fastest onall benchmarks.

The single threadedcfrac benchmark by Dave Barrett is an implementation ofcontinued fraction factorization which uses many small short-lived allocations.All allocators do well on such common usage, wheremimalloc is just a tadfaster thantcmalloc andjemalloc.

TheleanN program is interesting as a large realistic andconcurrent workload of theLeantheorem prover compiling its own standard library, and there is a 13%speedup overtcmalloc. This isquite significant: if Lean spends 20% of its time in theallocator that means thatmimalloc is 1.6× faster thantcmallochere. (This is surprising as that is not measured in a pureallocation benchmark likealloc-test. We conjecture that we see thisoutsized improvement here becausemimalloc has better locality inthe allocation which improves performance for theother computationsin a program as well).

The single threadedredis benchmark again show that most allocators do well on such workloads.

ThelarsonN server benchmark by Larson and Krishnan [2] allocates and frees between threads. They observed thisbehavior (which they callbleeding) in actual server applications, and the benchmark simulates this.Here,mimalloc is quite a bit faster thantcmalloc andjemalloc probably due to the object migration between different threads.

ThemstressN workload performs many allocations and re-allocations,and migrates objects between threads (as inlarsonN). However, it alsocreates and destroys theN worker threads a few times keeping some objectsalive beyond the life time of the allocating thread. We observed thisbehavior in many larger server applications.

TherptestN benchmarkby Mattias Jansson is a allocator test originally designedforrpmalloc, and tries to simulate realistic allocation patterns overmultiple threads. Here the differences between allocators become more apparent.

The second benchmark set tests specific aspects of the allocators andshows even more extreme differences between them.

Thealloc-test, byOLogN Technologies AG, is a very allocation intensive benchmark doing millions ofallocations in various size classes. The test is scaled such that when anallocator performs almost identically onalloc-test1 asalloc-testN itmeans that it scales linearly.

Thesh6bench andsh8bench benchmarks aredeveloped byMicroQuill as part of SmartHeap.Insh6benchmimalloc does muchbetter than the others (more than 2.5× faster thanjemalloc).We cannot explain this well but believe it iscaused in part by the "reverse" free-ing pattern insh6bench.Thesh8bench is a variation with object migrationbetween threads; whereastcmalloc did well onsh6bench, the addition of object migration causes it to be 10× slower than before.

Thexmalloc-testN benchmark by Lever and Boreham [5] and Christian Eder, simulates an asymmetric workload wheresome threads only allocate, and others only free -- they observed this pattern inlarger server applications. Here we see thatthemimalloc technique of having non-contended sharded thread freelists pays off as it outperforms others by a very large margin. Onlyrpmalloc,tbb, andglibc also scale well on this benchmark.

Thecache-scratch benchmark by Emery Berger [1], and introduced withthe Hoard allocator to test forpassive-false sharing of cache lines.With a single thread they allperform the same, but when running with multiple threads the potential allocatorinduced false sharing of the cache lines can cause large run-time differences.Crundal [6] describes in detail why the false cache line sharing occurs in thetcmalloc design, and also discusses how thiscan be avoided with some small implementation changes.Only thetbb,rpmalloc andmesh allocators also avoid thecache line sharing completely, whileHoard andglibc seem to mitigatethe effects. Kukanov and Voss [7] describe in detailhow the design oftbb avoids the false cache line sharing.

On a 36-core Intel Xeon

For completeness, here are the results on a big Amazonc5.18xlarge instanceconsisting of a 2×18-core Intel Xeon (Cascade Lake) at 3.4GHz (boost 3.5GHz)with 144GiB ECC memory, runningUbuntu 20.04 with glibc 2.31, GCC 9.3.0, andClang 10.0.0. This time, the mimalloc allocators (mi, xmi, and smi) werecompiled with the Clang compiler instead of GCC.The results are similar to the AMD results but it is interesting tosee the differences in thelarsonN,mstressN, andxmalloc-testN benchmarks.

Peak Working Set

The following figure shows the peak working set (rss) of the allocatorson the benchmarks (on the c5.18xlarge instance).

Note that thexmalloc-testN memory usage should be disregarded as itallocates more the faster the program runs. Similarly, memory usage oflarsonN,mstressN,rptestN andsh8bench can vary depending on scheduling andspeed. Nevertheless, we hope to improve the memory usage onmstressNandrptestN (just ascfrac,larsonN andsh8bench have a small working set which skews the results).

References

[1] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson.Hoard: A Scalable Memory Allocator for Multithreaded Applicationsthe Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November 2000.pdf
[2] P. Larson and M. Krishnan.Memory allocation for long-running server applications.In ISMM, Vancouver, B.C., Canada, 1998.pdf
[3] D. Grunwald, B. Zorn, and R. Henderson.Improving the cache locality of memory allocation. In R. Cartwright, editor,Proceedings of the Conference on Programming Language Design and Implementation, pages 177–186, New York, NY, USA, June 1993.pdf
[4] J. Barnes and P. Hut.A hierarchical O(n*log(n)) force-calculation algorithm. Nature, 324:446-449, 1986.
[5] C. Lever, and D. Boreham.Malloc() Performance in a Multithreaded Linux Environment.In USENIX Annual Technical Conference, Freenix Session. San Diego, CA. Jun. 2000.Available athttps://github.com/kuszmaul/SuperMalloc/tree/master/tests
[6] Timothy Crundal.Reducing Active-False Sharing in TCMalloc. 2016. CS16S1 project at the Australian National University.pdf
[7] Alexey Kukanov, and Michael J Voss.The Foundations for Scalable Multi-Core Software in Intel Threading Building Blocks.Intel Technology Journal 11 (4). 2007
[8] Bobby Powers, David Tench, Emery D. Berger, and Andrew McGregor.Mesh: Compacting Memory Management for C/C++In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'19), June 2019, pages 333-–346.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to aContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant usthe rights to use your contribution. For details, visithttps://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to providea CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructionsprovided by the bot. You will only need to do this once across all repos using our CLA.

Older Release Notes

2021-11-14,v1.7.3,v2.0.3 (beta): improved WASM support, improved macOS support and performance (includingM1), improved performance for v2 for large objects, Python integration improvements, more standardinstallation directories, various small fixes.
2021-06-17,v1.7.2,v2.0.2 (beta): support M1, better installation layout on Linux, fixthread_id on Android, prefer 2-6TiB area for aligned allocation to work better on pre-windows 8, various small fixes.
2021-04-06,v1.7.1,v2.0.1 (beta): fix bug in arena allocation for huge pages, improved aslr on large allocations, initial M1 support (still experimental).
2021-01-31,v2.0.0: beta release 2.0: new slice algorithm for managing internal mimalloc pages.
2021-01-31,v1.7.0: stable release 1.7: support explicit user provided memory regions, more precise statistics,improve macOS overriding, initial support for Apple M1, improved DragonFly support, faster memcpy on Windows, various small fixes.
2020-09-24,v1.6.7: stable release 1.6: using standard C atomics, passing tsan testing, improvedhandling of failing to commit on Windows, addmi_process_info api call.
2020-08-06,v1.6.4: stable release 1.6: improved error recovery in low-memory situations,support for IllumOS and Haiku, NUMA support for Vista/XP, improved NUMA detection for AMD Ryzen, ubsan support.
2020-05-05,v1.6.3: stable release 1.6: improved behavior in out-of-memory situations, improved malloc zones on macOS,build PIC static libraries by default, add option to abort on out-of-memory, line buffered statistics.
2020-04-20,v1.6.2: stable release 1.6: fix compilation on Android, MingW, Raspberry, and Conda,stability fix for Windows 7, fix multiple mimalloc instances in one executable, fixstrnlen overload,fix aligned debug padding.
2020-02-17,v1.6.1: stable release 1.6: minor updates (build with clang-cl, fix alignment issue for small objects).
2020-02-09,v1.6.0: stable release 1.6: fixed potential memory leak, improved overridingand thread local support on FreeBSD, NetBSD, DragonFly, and macOSX. New byte-preciseheap block overflow detection in debug mode (besides the double-free detection and free-listcorruption detection). Addnodiscard attribute to most allocation functions.EnableMIMALLOC_PAGE_RESET by default. New reclamation strategy for abandoned heap pagesfor better memory footprint.
2020-02-09,v1.5.0: stable release 1.5: improved free performance, small bug fixes.
2020-01-22,v1.4.0: stable release 1.4: improved performance for delayed OS page reset,more eager concurrent free, addition of STL allocator, fixed potential memory leak.
2020-01-15,v1.3.0: stable release 1.3: bug fixes, improved randomness andstrongerfree list encoding in secure mode.
2019-12-22,v1.2.2: stable release 1.2: minor updates.
2019-11-22,v1.2.0: stable release 1.2: bug fixes, improved secure mode (free list corruption checks, double free mitigation). Improved dynamic overriding on Windows.
2019-10-07,v1.1.0: stable release 1.1.
2019-09-01,v1.0.8: pre-release 8: more robust windows dynamic overriding, initial huge page support.
2019-08-10,v1.0.6: pre-release 6: various performance improvements.

About

mimalloc is a compact general purpose allocator with excellent performance.

Releases

35tags

Packages

No packages published

Languages

C82.0%
C++13.8%
CMake3.5%
Other0.7%

Movatterモバイル変換

License

oven-sh/mimalloc

Folders and files

Latest commit

History

Repository files navigation

mimalloc

Branches

Releases

Usage

Building

Windows

macOS, Linux, BSD, etc.

Single source

Using the library

Environment Options

Secure Mode

Debug Mode

Overriding Standard Malloc

Dynamic override

Dynamic Override on Linux, BSD

Dynamic Override on MacOS

Dynamic Override on Windows

Static override

Tools

Valgrind

ASAN

ETW

Performance

Benchmark Results on a 16-core AMD 5950x (Zen3)

On a 36-core Intel Xeon

Peak Working Set

References

Contributing

Older Release Notes

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages