Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: support mmap for model loading#1059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
wbruna wants to merge1 commit intoleejet:master
base:master
Choose a base branch
Loading
fromwbruna:sd_mmap_io

Conversation

@wbruna
Copy link
Contributor

Introduces a new--use-mmap flag that replaces model loading I/O operations withmmap +memcpy.

In my tests, this helps model loading speed slightly, though the gain was never higher than half a second. Its primary benefit right now is validation of themmap backend implementation. Later, I plan to extend this to allow the mapped file to serve directly as weight storage for backends that use main memory.

I used a non-default flag to be extra safe, but we could arguably follow llama.cpp approach, with a--no-mmap flag to disable it instead.

I was only able to test (and build...) it under Linux, so additional testing is very welcome 🙂

daniandtheweb, Green-Sky, whitespace-rebel, and rene-descartes2021 reacted with thumbs up emoji
@Green-Sky
Copy link
Contributor

How much value would it be if llama.cpp exported the mmap stuff as a library?

@wbruna
Copy link
ContributorAuthor

How much value would it be if llama.cpp exported the mmap stuff as a library?

I don't think it'd help that much right now. The mmap part itself is more-or-less straightforward; replacing the current alloc+memcpy code with a buffer managed externally will be much trickier.

Green-Sky reacted with eyes emoji

@valkarias
Copy link

valkarias commentedDec 10, 2025
edited
Loading

Have you experimented with MMaping then copying to GPU?
In my experience. I've restricted MMapping only to CPU inference & loading. MMap -> copy to GPU became a bottleneck for some reason (I assume page size potentially?)

@wbruna
Copy link
ContributorAuthor

Have you experimented with MMaping then copying to GPU? In my experience. I've restricted MMapping only to CPU inference & loading. MMap -> copy to GPU became a bottleneck for some reason (I assume page size potentially?)

Not yet. Right now I'm just reusing the I/O buffer; adding a separate code path to deliver the mapped area directly to the backend just to avoid a memcpy sounded like too much change for too little potential gain.

That behavior you describe sounds... odd. At least on Linux, large dynamically-allocated memory areas use mmap as backend anyway, so they should behave the same. Maybe it's a difference betweenfile -backed and anonymous mappings.

@wbruna
Copy link
ContributorAuthor

The direct use of the memory-mapped area for the tensor weights seems a lot more involved, so I think it' best to keep this simple for now, and add that support on a follow-up PR.

I changed the command-line to enable it by default, since that's the llama.cpp approach, but I can change it back if an opt-in is preferable.

@wbrunawbruna marked this pull request as ready for reviewDecember 14, 2025 21:11
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@wbruna@Green-Sky@valkarias

[8]ページ先頭

©2009-2025 Movatter.jp