Implement PagedAttention algorithm for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics.

The implementation is experimental and disabled by default. Enable with the --pagedattention flag

ericcurtin requested review fromCISC andggerganov ascode owners

November 28, 2025 19:36

ericcurtin force-pushed theadd-pagedattention branch 3 times, most recently from2a33486 to14ad291Compare

November 28, 2025 19:58

github-actionsbot added Nvidia GPU

Issues specific to Nvidia GPUs

ggmlchanges relating to the ggml tensor library for machine learning labels

Nov 28, 2025

ericcurtin force-pushed theadd-pagedattention branch from14ad291 to06254d1Compare

November 28, 2025 20:17

loci-dev mentioned this pull request

Nov 28, 2025

UPSTREAM PR #17579: Add PagedAttention support (experimental, CUDA only)auroralabs-loci/llama.cpp#352

Open

ericcurtin force-pushed theadd-pagedattention branch from06254d1 to1745418Compare

November 28, 2025 20:37

ngxson reviewed

Nov 28, 2025

View reviewed changes

ggml/src/ggml-cuda/paged-attention-v1.cu

		const int token_idx = block_idx * BLOCK_SIZE + i;
		if (token_idx >= seq_len) break;

		// TODO: Vectorized K loading and Q·K computation

Copy link

Collaborator

ngxsonNov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

some TODOs look quite sus, I'm wondering if the code is AI-generated and/or this function actually works

beside, probably give some credits to the original kernel:https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cuh

Copy link

CollaboratorAuthor

ericcurtinNov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I mark it experimental for good reason 🙂

Copy link

Collaborator

ngxsonNov 28, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think it's important to explicitly state if you're using AI to generate this PR or not. the numerous TODOs though out the PR does make it look sus. there will be a human who spend real time and efforts reviewing this PR afterall.

Copy link

Contributor

ddh0Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I mark it experimental for good reason 🙂

I think this PR should be marked as a draft, until it is no longer experimental

Add PagedAttention support (experimental, CUDA only)

f0b133d

Implement PagedAttention algorithm from for memory-efficient KV cachemanagement. This feature reduces memory fragmentation by storing KV cachein fixed-size blocks (similar to virtual memory paging) and enablesefficient memory sharing between sequences through copy-on-write semantics.The implementation is experimental and disabled by default. Enable withthe --pagedattention flagSigned-off-by: Eric Curtin <eric.curtin@docker.com>

ericcurtin force-pushed theadd-pagedattention branch from1745418 tof0b133dCompare

November 29, 2025 13:10

Labels

ggml

changes relating to the ggml tensor library for machine learning

Nvidia GPU

Issues specific to Nvidia GPUs

3 participants

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PagedAttention support (experimental, CUDA only)#17579

Are you sure you want to change the base?

Add PagedAttention support (experimental, CUDA only)#17579

Conversation

ericcurtin commentedNov 28, 2025•
edited
Loading

Uh oh!