- Notifications
You must be signed in to change notification settings - Fork26.3k
Removed ROCM ifdef that governs thread count + smem parallel reduction.#149779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pytorch-botbot commentedMar 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/149779
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commitf485bbb with merge base14f0cd7 ( NEW FAILURE - The following job has failed:
UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
linux-foundation-easyclabot commentedMar 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
5had3z commentedMar 22, 2025
Those failures don't seem to be related to my changes? The NN test also works fine when I call directly with
I'm also curious as to the purpose of |
jeffdaily commentedMar 27, 2025
@5had3z I'm attempting to rerun the failed CUDA UT job. The CUDA benchmark job failure is unrelated. Also, I forgot to trigger ROCm CI so I've done that now. I approved your changes since they LGTM but pending CI passing. |
jeffdaily commentedMar 27, 2025
@5had3z rerunning job got the same error. Let's try a rebase this time and see if we just happened to get an unlucky random set of inputs for that test. |
jeffdaily commentedMar 27, 2025
@pytorchbot rebase |
pytorchmergebot commentedMar 27, 2025
@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere |
Signed-off-by: Bryce Ferenczi <frenzi@hotmail.com.au>
Signed-off-by: Bryce Ferenczi <frenzi@hotmail.com.au>
pytorchmergebot commentedMar 27, 2025
Successfully rebased |
aa4e171 tof485bbbComparecyyever commentedMar 29, 2025
@pytorchmergebot merge -f "Unrelated failures" |
pytorchmergebot commentedMar 29, 2025
Merge startedYour change will be merged immediately since you used the force (-f) flag,bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
…n. (pytorch#149779)pytorch#149548 Fixed the arbitrarily missing parallelism for NLL, but they also added an arbritrary #ifdef ROCM guard around this fix to prevent its use on CUDA gpus. There is also a problem with the way the kernel does the reduction from the intermediate shared memory, using only thread 0 walking linearly. This has been changed to a simple parallel reduction algorithm.Tested changes with `python3 test/test_nn.py````Ran 3551 tests in 200.554sOK (skipped=998, expected failures=4)```Performance before and after with the script below with an RTX 3090, batch size x axis, time (sec) y axis. This GPU is also used for display graphics and such, so the measurements are pretty noisy, even with 100 samples.## Before## After ifdef removal## After Parallel SMEM reduction```pythonimport torchfrom matplotlib import pyplot as pltfrom torch.nn import functional as Ftiming = []batches= list(range(32, 4096, 32))for batch in [32] + batches: samples = [] for _ in range(100): probs = torch.rand(batch, 10).cuda() labels = torch.randint(0, 10, (batch,)).cuda() start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) start.record() F.nll_loss(probs, labels) end.record() torch.cuda.synchronize() elapsed = start.elapsed_time(end) samples.append(elapsed) timing.append(sum(samples) / len(samples))timing = timing[1:]plt.plot(batches, timing)plt.show()```Pull Requestresolved:pytorch#149779Approved by:https://github.com/jeffdaily
Uh oh!
There was an error while loading.Please reload this page.
#149548 Fixed the arbitrarily missing parallelism for NLL, but they also added an arbritrary #ifdef ROCM guard around this fix to prevent its use on CUDA gpus. There is also a problem with the way the kernel does the reduction from the intermediate shared memory, using only thread 0 walking linearly. This has been changed to a simple parallel reduction algorithm.
Tested changes with
python3 test/test_nn.pyPerformance before and after with the script below with an RTX 3090, batch size x axis, time (sec) y axis. This GPU is also used for display graphics and such, so the measurements are pretty noisy, even with 100 samples.
Before
After ifdef removal
After Parallel SMEM reduction
cc@jeffdaily@sunway513@jithunnair-amd@pruthvistony@ROCmSupport@dllehr-amd@jataylo@hongxiayang@naromero77amd