In my tests, this helps model loading speed slightly, though the gain was never higher than half a second. Its primary benefit right now is validation of themmap backend implementation. Later, I plan to extend this to allow the mapped file to serve directly as weight storage for backends that use main memory.

I used a non-default flag to be extra safe, but we could arguably follow llama.cpp approach, with a--no-mmap flag to disable it instead.

I was only able to test (and build...) it under Linux, so additional testing is very welcome 🙂

Copy link

Contributor

Green-Sky commentedDec 7, 2025

How much value would it be if llama.cpp exported the mmap stuff as a library?

Copy link

ContributorAuthor

wbruna commentedDec 9, 2025

How much value would it be if llama.cpp exported the mmap stuff as a library?

I don't think it'd help that much right now. The mmap part itself is more-or-less straightforward; replacing the current alloc+memcpy code with a buffer managed externally will be much trickier.

Copy link

valkarias commentedDec 10, 2025•
edited
Loading

Have you experimented with MMaping then copying to GPU?
In my experience. I've restricted MMapping only to CPU inference & loading. MMap -> copy to GPU became a bottleneck for some reason (I assume page size potentially?)

Copy link

ContributorAuthor

wbruna commentedDec 10, 2025

Have you experimented with MMaping then copying to GPU? In my experience. I've restricted MMapping only to CPU inference & loading. MMap -> copy to GPU became a bottleneck for some reason (I assume page size potentially?)

Not yet. Right now I'm just reusing the I/O buffer; adding a separate code path to deliver the mapped area directly to the backend just to avoid a memcpy sounded like too much change for too little potential gain.

That behavior you describe sounds... odd. At least on Linux, large dynamically-allocated memory areas use mmap as backend anyway, so they should behave the same. Maybe it's a difference betweenfile -backed and anonymous mappings.

wbruna force-pushed thesd_mmap_io branch fromdb1592e to73da9fbCompare

December 12, 2025 10:43

feat: support mmap for model loading

4ca7ae3

wbruna force-pushed thesd_mmap_io branch from73da9fb to4ca7ae3Compare

December 14, 2025 21:06

Copy link

ContributorAuthor

wbruna commentedDec 14, 2025

The direct use of the memory-mapped area for the tensor weights seems a lot more involved, so I think it' best to keep this simple for now, and add that support on a follow-up PR.

I changed the command-line to enable it by default, since that's the llama.cpp approach, but I can change it back if an opt-in is preferable.

wbruna marked this pull request as ready for review

December 14, 2025 21:11

Labels

None yet

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support mmap for model loading#1059

Are you sure you want to change the base?

feat: support mmap for model loading#1059

Uh oh!