- Notifications
You must be signed in to change notification settings - Fork475
feat: support mmap for model loading#1059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Green-Sky commentedDec 7, 2025
How much value would it be if llama.cpp exported the mmap stuff as a library? |
wbruna commentedDec 9, 2025
I don't think it'd help that much right now. The mmap part itself is more-or-less straightforward; replacing the current alloc+memcpy code with a buffer managed externally will be much trickier. |
valkarias commentedDec 10, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Have you experimented with MMaping then copying to GPU? |
wbruna commentedDec 10, 2025
Not yet. Right now I'm just reusing the I/O buffer; adding a separate code path to deliver the mapped area directly to the backend just to avoid a memcpy sounded like too much change for too little potential gain. That behavior you describe sounds... odd. At least on Linux, large dynamically-allocated memory areas use mmap as backend anyway, so they should behave the same. Maybe it's a difference betweenfile -backed and anonymous mappings. |
wbruna commentedDec 14, 2025
The direct use of the memory-mapped area for the tensor weights seems a lot more involved, so I think it' best to keep this simple for now, and add that support on a follow-up PR. I changed the command-line to enable it by default, since that's the llama.cpp approach, but I can change it back if an opt-in is preferable. |
Introduces a new
--use-mmapflag that replaces model loading I/O operations withmmap+memcpy.In my tests, this helps model loading speed slightly, though the gain was never higher than half a second. Its primary benefit right now is validation of the
mmapbackend implementation. Later, I plan to extend this to allow the mapped file to serve directly as weight storage for backends that use main memory.I used a non-default flag to be extra safe, but we could arguably follow llama.cpp approach, with a
--no-mmapflag to disable it instead.I was only able to test (and build...) it under Linux, so additional testing is very welcome 🙂