Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork34k
Closed
Description
Numpy has some wrappers for data allocation that call into the tracemalloc C API. For example, here's the wrapper aroundmalloc:
Recently astackoverflow question led me to report anumpy issue about poor multithreaded scaling. I think the bulk of the scaling bottleneck is due tothe global mutex in the tracemalloc implementation, as you can see in the flame graph and profile in the linked NumPy issue.
From the NumPy issue:
On my M3 Macbook Pro, I get the following stdout running the script:
Inner loops 10, multithreading time: 6.68 sec, result sum: 717434683.1879175Inner loops 10, multiprocessing time: 4.86 sec, result sum: 717434683.1879175
@Yhg1s told me on Discord that he has a patch that adds a fast path to tracemalloc based on an atomic flag and that seems to help a lot.