NotificationsYou must be signed in to change notification settings
Fork34k
Star71.2k

Tracemalloc C API scales poorly in multithreaded use #143057

Closed

Tracemalloc C API scales poorly in multithreaded use#143057

Labels

interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagetype-bugAn unexpected behavior, bug, or error

Description

ngoldbaum

opened

on Dec 22, 2025

Numpy has some wrappers for data allocation that call into the tracemalloc C API. For example, here's the wrapper aroundmalloc:

https://github.com/numpy/numpy/blob/f6440be7b8eec4a6481832f15f6730d984d78ef0/numpy/_core/src/multiarray/alloc.c#L255-L271

Recently astackoverflow question led me to report anumpy issue about poor multithreaded scaling. I think the bulk of the scaling bottleneck is due tothe global mutex in the tracemalloc implementation, as you can see in the flame graph and profile in the linked NumPy issue.

From the NumPy issue:

On my M3 Macbook Pro, I get the following stdout running the script:
Inner loops 10, multithreading  time: 6.68 sec, result sum: 717434683.1879175Inner loops 10, multiprocessing time: 4.86 sec, result sum: 717434683.1879175

@Yhg1s told me on Discord that he has a patch that adds a fast path to tracemalloc based on an atomic flag and that seems to help a lot.

Linked PRs

Metadata

Assignees

No one assigned

Labels

interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagetype-bugAn unexpected behavior, bug, or error

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tracemalloc C API scales poorly in multithreaded use #143057

Description

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions