Summary
In this PR we modify the m_warptile configurations to increase prompt processing performance for Intel Xe2+ GPUs that support coopmat extension

Changes

Increase WGSIZE 4x
Decrease num workgroups dispatched on MM shaders by 2x in x-dim and 2x in y-dim (4x total decrease)
Increase BM, BN each by 2x

Accuracy Check
Basic testing w/ llama-cli across models that show perf jump (+ trying different prompt sizes) - model output looks correct / reasonable.

Unit tests Check
Checked on system w/ Arc B580 + Intel Integrated Graphics. All unit tests pass.

Performance Results
Command ran isllama-bench.exe -m ..\<model> -p 512 -r 5 -n 128
The eval token gen results don't change and weren't expected to, only prompt processing :) Below numbers show prompt processing in tok/s for Arc B580 and Lunar Lake Series 2 IGPU.

PR Status
Ready for Review

virajwad added4 commits

December 14, 2025 19:50

modify warptile tuning for xe3

0f82d0a

intel vendor check w/ coopmat support

c908711

fix back formatting

3441283

fix formatting change 2

6b7f1e8

virajwad requested a review from0cc4m as acode owner

December 18, 2025 17:07

github-actionsbot added Vulkan

Issues specific to the Vulkan backend

ggmlchanges relating to the ggml tensor library for machine learning labels

Dec 18, 2025

jeffbolznv reviewed

Dec 18, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp

		if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) {
		if (device->coopmat_support && device->architecture == INTEL_XE2) {
		// Xe2/Xe3 with coopmat enabled - warptile performance tuning
		m_warptile = {512,128,128,16, subgroup_size_8,32,2, tm_m, tn_m, tk_m, subgroup_size_8 };

Copy link

Collaborator

jeffbolznvDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I wonder if this should actually be the large tile size?

Also, a quick google search suggests Xe2 has 64KB of register file per core, which with 512 invocations is only 32 registers each which seems very low. But I've never worked on this hardware so I'm just speculating.

Copy link

Author

virajwadDec 18, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hi@jeffbolznv Sure I can look at re-enabling large warptile size for Intelhere and then moving the warptile config from m_ to l_. I'll also check perf again after the change.

Are you doing (64 * 1024) / 512 invocations is 128 bytes per invocation and the assumption is 4 byte width register? (to get 32 registers per invocation?)

Copy link

Collaborator

jeffbolznvDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, that's the calculation I did.

Copy link

Author

virajwadDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks Jeff, forXe architecture each register in GRF was 32 Byte wide. But I need to look into the register situation a bit deeper

Copy link

Collaborator

jeffbolznv commentedDec 18, 2025

CC@mmerecki in case this makes sense to also enable for Linux.

loci-dev mentioned this pull request

Dec 18, 2025

UPSTREAM PR #18178: vulkan: Warptile tuning for Intel Xe2/Xe3auroralabs-loci/llama.cpp#616

Open

netrunnereve reviewed

Dec 18, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated

		m_align =64;
		s_align =32;

		if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) {

Copy link

Collaborator

netrunnereveDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah it's good to see more tunes show up 😃. Please move your tune to line 2845 so that they're all placed together.

Copy link

Author

virajwadDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hi@netrunnereve thanks! I can do that but the wg_denoms change also needs to overwrite the default and be 'paired' with this tuning config to pass unit tests. Do you want me to make two separate 'if' statements with the same conditions?

Copy link

Collaborator

netrunnereveDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would just move this section above line 2841.

        l_mmq_wg_denoms = l_wg_denoms = {128, 128, 1 };        m_mmq_wg_denoms = m_wg_denoms = { 64,  64, 1 };        s_mmq_wg_denoms = s_wg_denoms = { 32,  32, 1 };        l_align = 128;        m_align =  64;        s_align =  32;

Copy link

Author

virajwadDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Made the change

Copy link

Collaborator

netrunnereve commentedDec 18, 2025

Oh and please run a before and after./bin/test-backend-ops perf -o MUL_MAT -p "n=512" to make sure that all quants actually run faster after tuning.

move intel check to chip specific tuning part

0cfa616

Copy link

Author

virajwad commentedDec 18, 2025•
edited
Loading

Oh and please run a before and after./bin/test-backend-ops perf -o MUL_MAT -p "n=512" to make sure that all quants actually run faster after tuning.

Thanks! I checked on Arc B580, everything saw good improvement (except for type_a=bf16 test which was same perf)

My IGPU had same perf on all quants as it doesn't support coopmat

Labels

ggml

changes relating to the ggml tensor library for machine learning

Vulkan

Issues specific to the Vulkan backend

Movatterモバイル変換

vulkan: Warptile tuning for Intel Xe2/Xe3#18178

Are you sure you want to change the base?

vulkan: Warptile tuning for Intel Xe2/Xe3#18178

Conversation

virajwad commentedDec 18, 2025

Uh oh!

jeffbolznvDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

virajwadDec 18, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffbolznvDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

virajwadDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv commentedDec 18, 2025

Uh oh!

netrunnereveDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

virajwadDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

netrunnereveDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

virajwadDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

netrunnereve commentedDec 18, 2025

Uh oh!

virajwad commentedDec 18, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

virajwadDec 18, 2025•
edited
Loading

virajwad commentedDec 18, 2025•
edited
Loading