Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add PagedAttention support (experimental, CUDA only)#17579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ericcurtin wants to merge1 commit intoggml-org:master
base:master
Choose a base branch
Loading
fromericcurtin:add-pagedattention

Conversation

@ericcurtin
Copy link
Collaborator

@ericcurtinericcurtin commentedNov 28, 2025
edited
Loading

Implement PagedAttention algorithm for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics.

The implementation is experimental and disabled by default. Enable with the --pagedattention flag

Green-Sky, o7si, and taronaeo reacted with thumbs up emoji
@ericcurtinericcurtinforce-pushed theadd-pagedattention branch 3 times, most recently from2a33486 to14ad291CompareNovember 28, 2025 19:58
@github-actionsgithub-actionsbot added Nvidia GPUIssues specific to Nvidia GPUs ggmlchanges relating to the ggml tensor library for machine learning labelsNov 28, 2025
const int token_idx = block_idx * BLOCK_SIZE + i;
if (token_idx >= seq_len) break;

// TODO: Vectorized K loading and Q·K computation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

some TODOs look quite sus, I'm wondering if the code is AI-generated and/or this function actually works

beside, probably give some credits to the original kernel:https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cuh

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I mark it experimental for good reason 🙂

Copy link
Collaborator

@ngxsonngxsonNov 28, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think it's important to explicitly state if you're using AI to generate this PR or not. the numerous TODOs though out the PR does make it look sus. there will be a human who spend real time and efforts reviewing this PR afterall.

emjomi reacted with thumbs up emoji
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I mark it experimental for good reason 🙂

I think this PR should be marked as a draft, until it is no longer experimental

wsbagnsv1 reacted with thumbs up emoji
Implement PagedAttention algorithm from for memory-efficient KV cachemanagement. This feature reduces memory fragmentation by storing KV cachein fixed-size blocks (similar to virtual memory paging) and enablesefficient memory sharing between sequences through copy-on-write semantics.The implementation is experimental and disabled by default. Enable withthe --pagedattention flagSigned-off-by: Eric Curtin <eric.curtin@docker.com>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@ngxsonngxsonngxson left review comments

@CISCCISCAwaiting requested review from CISCCISC is a code owner

@ggerganovggerganovAwaiting requested review from ggerganovggerganov is a code owner

+1 more reviewer

@ddh0ddh0ddh0 left review comments

Reviewers whose approvals may not affect merge requirements

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

ggmlchanges relating to the ggml tensor library for machine learningNvidia GPUIssues specific to Nvidia GPUs

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@ericcurtin@ngxson@ddh0

[8]ページ先頭

©2009-2025 Movatter.jp