Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

vulkan: Warptile tuning for Intel Xe2/Xe3#18178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
virajwad wants to merge5 commits intoggml-org:master
base:master
Choose a base branch
Loading
fromvirajwad:xe3_perf_changes

Conversation

@virajwad
Copy link

Summary
In this PR we modify the m_warptile configurations to increase prompt processing performance for Intel Xe2+ GPUs that support coopmat extension

Changes

  1. Increase WGSIZE 4x
  2. Decrease num workgroups dispatched on MM shaders by 2x in x-dim and 2x in y-dim (4x total decrease)
  3. Increase BM, BN each by 2x

Accuracy Check
Basic testing w/ llama-cli across models that show perf jump (+ trying different prompt sizes) - model output looks correct / reasonable.

Unit tests Check
Checked on system w/ Arc B580 + Intel Integrated Graphics. All unit tests pass.

image

Performance Results
Command ran isllama-bench.exe -m ..\<model> -p 512 -r 5 -n 128
The eval token gen results don't change and weren't expected to, only prompt processing :) Below numbers show prompt processing in tok/s for Arc B580 and Lunar Lake Series 2 IGPU.

imageimage

PR Status
Ready for Review

@github-actionsgithub-actionsbot added VulkanIssues specific to the Vulkan backend ggmlchanges relating to the ggml tensor library for machine learning labelsDec 18, 2025
if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) {
if (device->coopmat_support && device->architecture == INTEL_XE2) {
// Xe2/Xe3 with coopmat enabled - warptile performance tuning
m_warptile = {512,128,128,16, subgroup_size_8,32,2, tm_m, tn_m, tk_m, subgroup_size_8 };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I wonder if this should actually be the large tile size?

Also, a quick google search suggests Xe2 has 64KB of register file per core, which with 512 invocations is only 32 registers each which seems very low. But I've never worked on this hardware so I'm just speculating.

Copy link
Author

@virajwadvirajwadDec 18, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hi@jeffbolznv Sure I can look at re-enabling large warptile size for Intelhere and then moving the warptile config from m_ to l_. I'll also check perf again after the change.

Are you doing (64 * 1024) / 512 invocations is 128 bytes per invocation and the assumption is 4 byte width register? (to get 32 registers per invocation?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, that's the calculation I did.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks Jeff, forXe architecture each register in GRF was 32 Byte wide. But I need to look into the register situation a bit deeper

@jeffbolznv
Copy link
Collaborator

CC@mmerecki in case this makes sense to also enable for Linux.

m_align =64;
s_align =32;

if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah it's good to see more tunes show up 😃. Please move your tune to line 2845 so that they're all placed together.

virajwad reacted with heart emoji
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hi@netrunnereve thanks! I can do that but the wg_denoms change also needs to overwrite the default and be 'paired' with this tuning config to pass unit tests. Do you want me to make two separate 'if' statements with the same conditions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would just move this section above line 2841.

        l_mmq_wg_denoms = l_wg_denoms = {128, 128, 1 };        m_mmq_wg_denoms = m_wg_denoms = { 64,  64, 1 };        s_mmq_wg_denoms = s_wg_denoms = { 32,  32, 1 };        l_align = 128;        m_align =  64;        s_align =  32;

virajwad reacted with thumbs up emoji
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Made the change

@netrunnereve
Copy link
Collaborator

Oh and please run a before and after./bin/test-backend-ops perf -o MUL_MAT -p "n=512" to make sure that all quants actually run faster after tuning.

virajwad reacted with thumbs up emoji

@virajwad
Copy link
Author

virajwad commentedDec 18, 2025
edited
Loading

Oh and please run a before and after./bin/test-backend-ops perf -o MUL_MAT -p "n=512" to make sure that all quants actually run faster after tuning.

Thanks! I checked on Arc B580, everything saw good improvement (except for type_a=bf16 test which was same perf)

image

My IGPU had same perf on all quants as it doesn't support coopmat

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@jeffbolznvjeffbolznvjeffbolznv left review comments

@0cc4m0cc4mAwaiting requested review from 0cc4m0cc4m is a code owner

+1 more reviewer

@netrunnerevenetrunnerevenetrunnereve left review comments

Reviewers whose approvals may not affect merge requirements

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

ggmlchanges relating to the ggml tensor library for machine learningVulkanIssues specific to the Vulkan backend

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@virajwad@jeffbolznv@netrunnereve

[8]ページ先頭

©2009-2025 Movatter.jp