NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

fix apparent copy-paste bug in log_softmax reduced-precision fp kernel#156379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

swolchok wants to merge2 commits intogh/swolchok/781/basefromgh/swolchok/781/head

Closed

fix apparent copy-paste bug in log_softmax reduced-precision fp kernel#156379

swolchok wants to merge2 commits intogh/swolchok/781/basefromgh/swolchok/781/head

Conversation

Copy link

Contributor

swolchok commentedJun 18, 2025•
edited by pytorch-botbot
Loading

Stack fromghstack (oldest at bottom):

->fix apparent copy-paste bug in log_softmax reduced-precision fp kernel #156379

This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught it

cc@jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

Update

deb5967

[ghstack-poisoned]

Copy link

pytorch-botbot commentedJun 18, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/156379

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commitd1f4041 with merge base6303cc4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-botbot added the module: cpuCPU specific problem (e.g., perf, algorithm) label

Jun 18, 2025

swolchok added a commit that referenced this pull request

Jun 18, 2025

fix apparent copy-paste bug in log_softmax reduced-precision fp kernel

b9754ec

This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught itghstack-source-id:c35a9efPull-Request:#156379

swolchok added the topic: bug fixestopic category label

Jun 19, 2025

Copy link

Collaborator

cyyever commentedJun 19, 2025

Why the kineto changes?

Copy link

ContributorAuthor

swolchok commentedJun 19, 2025

Why the kineto changes?

whoops, think I just forgot to update submodules when I pulled main

Copy link

ContributorAuthor

swolchok commentedJun 20, 2025•
edited
Loading

I've been stressing out trying to figure out why I can't detect this bug in testing. Apparently, this is because the actual value of the intermediate max used doesn't matter for "normal" values; it's a numerical accuracy thing called "safe softmax" and it cancels out (see e.g. "Why Safe Softmax Doesn't Change the Result"here). I imagine we could try to contrive a very specific case where it matters, but probably we should just fix it and move on.

Update

d1f4041

[ghstack-poisoned]

swolchok added a commit that referenced this pull request

Jun 20, 2025

fix apparent copy-paste bug in log_softmax reduced-precision fp kernel

b14568e

This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught itghstack-source-id:beb8e09Pull-Request:#156379

janeyx99 reviewed

Jun 20, 2025

View reviewed changes

aten/src/ATen/native/cpu/SoftMaxKernel.cpp

		max_fvec0 =fVec::blendv(max_fvec0, data_fvec0, data_fvec0 > max_fvec0);
		max_fvec1 =fVec::blendv(max_fvec1, data_fvec1, data_fvec1 > max_fvec1);
		max_fvec0.store(input_max_data + d1);
		max_fvec0.store(input_max_data + d1 +fVec::size());

Copy link

Contributor

janeyx99Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OMG I EVEN SAW THIS BEFORE AND GOT confused why we were storing twice into max_fvec0? But hmmm why do we store twice into min_fvec and zero_fvec in lines 1028-1031 above?

Copy link

ContributorAuthor

swolchokJun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I spent some time last night trying to understand how this kernel works (by interrogating the LLM of my choice about it since I am below average at intuitively understanding array indexing code without lots of pictures for some reason), so I actually can answer that! Lines 1028-1031 are storing the identities for the max and sum reductions into our array of accumulators.

We're making a non-contiguous reduction vectorizable anyway by doing anarray of reductions all at once. (We can note that outer_size is basically just a batch dimension and ignore it for purposes of understanding the kernel.) We slice up the inner dimensions into chunks of length CHUNK_SIZE a; the inner loops are doing CHUNK_SIZE reductions at once. Accordingly, they have an array of CHUNK_SIZE accumulators, which is what's getting initialized in lines 1028-1031. since the CHUNK_SIZE dimension is contiguous, we can vectorize along it and get "parallelization" that way through vector arithmetic. The blocking/chunking stuff is so that the dim_size x CHUNK_SIZE "vertical panel" that we hand to each thread fits in cache, since softmax will end up reading the data 3 times (for each inner loop -- max, sum + log, and data - logsum - max).