- Notifications
You must be signed in to change notification settings - Fork13.9k
Add PagedAttention support (experimental, CUDA only)#17579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Conversation
2a33486 to14ad291Compare14ad291 to06254d1Compare06254d1 to1745418Compare| const int token_idx = block_idx * BLOCK_SIZE + i; | ||
| if (token_idx >= seq_len) break; | ||
| // TODO: Vectorized K loading and Q·K computation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
some TODOs look quite sus, I'm wondering if the code is AI-generated and/or this function actually works
beside, probably give some credits to the original kernel:https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cuh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I mark it experimental for good reason 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think it's important to explicitly state if you're using AI to generate this PR or not. the numerous TODOs though out the PR does make it look sus. there will be a human who spend real time and efforts reviewing this PR afterall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I mark it experimental for good reason 🙂
I think this PR should be marked as a draft, until it is no longer experimental
Implement PagedAttention algorithm from for memory-efficient KV cachemanagement. This feature reduces memory fragmentation by storing KV cachein fixed-size blocks (similar to virtual memory paging) and enablesefficient memory sharing between sequences through copy-on-write semantics.The implementation is experimental and disabled by default. Enable withthe --pagedattention flagSigned-off-by: Eric Curtin <eric.curtin@docker.com>
1745418 tof0b133dCompare
Uh oh!
There was an error while loading.Please reload this page.
Implement PagedAttention algorithm for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics.
The implementation is experimental and disabled by default. Enable with the --pagedattention flag