NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

Removed ROCM ifdef that governs thread count + smem parallel reduction.#149779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

5had3z wants to merge2 commits intopytorch:mainfrom5had3z:nll_remove_threads_ifdef

Closed

Removed ROCM ifdef that governs thread count + smem parallel reduction.#149779

5had3z wants to merge2 commits intopytorch:mainfrom5had3z:nll_remove_threads_ifdef

Conversation

Copy link

Contributor

5had3z commentedMar 21, 2025•
edited by pytorch-botbot
Loading

#149548 Fixed the arbitrarily missing parallelism for NLL, but they also added an arbritrary #ifdef ROCM guard around this fix to prevent its use on CUDA gpus. There is also a problem with the way the kernel does the reduction from the intermediate shared memory, using only thread 0 walking linearly. This has been changed to a simple parallel reduction algorithm.

Tested changes withpython3 test/test_nn.py

Ran 3551 tests in 200.554sOK (skipped=998, expected failures=4)

Performance before and after with the script below with an RTX 3090, batch size x axis, time (sec) y axis. This GPU is also used for display graphics and such, so the measurements are pretty noisy, even with 100 samples.

Before

After ifdef removal

After Parallel SMEM reduction

importtorchfrommatplotlibimportpyplotaspltfromtorch.nnimportfunctionalasFtiming= []batches=list(range(32,4096,32))forbatchin [32]+batches:samples= []for_inrange(100):probs=torch.rand(batch,10).cuda()labels=torch.randint(0,10, (batch,)).cuda()start=torch.cuda.Event(enable_timing=True)end=torch.cuda.Event(enable_timing=True)start.record()F.nll_loss(probs,labels)end.record()torch.cuda.synchronize()elapsed=start.elapsed_time(end)samples.append(elapsed)timing.append(sum(samples)/len(samples))timing=timing[1:]plt.plot(batches,timing)plt.show()

cc@jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

5had3z requested review fromeqy andsyed-ahmed ascode owners

March 21, 2025 23:42

Copy link

pytorch-botbot commentedMar 21, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/149779

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commitf485bbb with merge base14f0cd7 ():

NEW FAILURE - The following job has failed:

pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, lf.ephemeral.linux.2xlarge) (gh)
test_modules_can_be_imported

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
REGRESSION: benchmark ('aotdispatcher_partitioner_cpu2', 'compile_time_instruction_count') failed, actual result 1742810572 is 1.56% higher than expected 1716000000 ±+1.50% if this is an expected regression, please update the expected results.
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.ephemeral.linux.2xlarge) (gh) (#144480)
backends/xnnpack/test/passes/test_convert_to_linear.py::TestConvertToLinear::test_fp32_convert_to_linear

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easyclabot commentedMar 21, 2025•
edited
Loading

The committers listed above are authorized under a signed CLA.

✅ login: 5had3z / name: Bryce Ferenczi (1875dcc,f485bbb)

pytorch-botbot added module: rocm

AMD GPU support for Pytorch

release notes: cudarelease notes category labels

Mar 21, 2025

pytorchbot added the open source label

Mar 21, 2025

Copy link

ContributorAuthor

5had3z commentedMar 22, 2025

Those failures don't seem to be related to my changes? The NN test also works fine when I call directly withpython3 test/test_nn.py TestNNDeviceTypeCUDA.test_variable_sequence_cuda_float32, so maybe there is a race condition or something.

TestNNDeviceTypeCUDA.test_variable_sequence_cuda_float32

WIN: benchmark ('sum_floordiv_regression', 'compile_time_instruction_count') failed, actual result 971748003 is -5.29% lower than expected 1026000000 ±1.50% please update the expected results.

I'm also curious as to the purpose ofNLLLoss2d.cu? torch.nn.NLLLoss2d is an alias for torch.nn.NLLLoss, so this seems to just be dead code?

colesbury added the triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module label

Mar 24, 2025

jeffdaily approved these changes

Mar 27, 2025

View reviewed changes

jeffdaily added the ciflow/rocmTrigger "default" config CI on ROCm label

Mar 27, 2025

Copy link

Collaborator

jeffdaily commentedMar 27, 2025

@5had3z I'm attempting to rerun the failed CUDA UT job. The CUDA benchmark job failure is unrelated. Also, I forgot to trigger ROCm CI so I've done that now. I approved your changes since they LGTM but pending CI passing.

Copy link

Collaborator

jeffdaily commentedMar 27, 2025

@5had3z rerunning job got the same error. Let's try a rebase this time and see if we just happened to get an unlucky random set of inputs for that test.

Copy link

Collaborator

jeffdaily commentedMar 27, 2025

@pytorchbot rebase

Copy link

Collaborator

pytorchmergebot commentedMar 27, 2025

@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere

5had3z added2 commits

March 27, 2025 23:04

Removed arbritrary ROCM ifdef that governs threads.

1875dcc

Signed-off-by: Bryce Ferenczi <frenzi@hotmail.com.au>

Use parallel reduction on smem results.

f485bbb

Signed-off-by: Bryce Ferenczi <frenzi@hotmail.com.au>

Copy link

Collaborator

pytorchmergebot commentedMar 27, 2025

Successfully rebasednll_remove_threads_ifdef ontorefs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, viagit checkout nll_remove_threads_ifdef && git pull --rebase)

pytorchmergebot force-pushed thenll_remove_threads_ifdef branch fromaa4e171 tof485bbbCompare

March 27, 2025 23:04

pytorch-botbot removed the ciflow/rocmTrigger "default" config CI on ROCm label

Mar 27, 2025

Copy link

Collaborator

cyyever commentedMar 29, 2025

@pytorchmergebot merge -f "Unrelated failures"

pytorchmergebot added the merging label

Mar 29, 2025

Copy link

Collaborator

pytorchmergebot commentedMar 29, 2025

Merge started

Your change will be merged immediately since you used the force (-f) flag,bypassing any CI checks (ETA: 1-5 minutes). Please use-f as last resort and instead consider-i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

Mar 29, 2025

pytorchmergebot closed this inbeea760

Mar 29, 2025

pytorchmergebot removed the merging label

Mar 29, 2025

amathewc pushed a commit to amathewc/pytorch that referenced this pull request

Apr 17, 2025

Removed ROCM ifdef that governs thread count + smem parallel reductio…

170427a

…n. (pytorch#149779)pytorch#149548 Fixed the arbitrarily missing parallelism for NLL, but they also added an arbritrary #ifdef ROCM guard around this fix to prevent its use on CUDA gpus. There is also a problem with the way the kernel does the reduction from the intermediate shared memory, using only thread 0 walking linearly. This has been changed to a simple parallel reduction algorithm.Tested changes with `python3 test/test_nn.py````Ran 3551 tests in 200.554sOK (skipped=998, expected failures=4)```Performance before and after with the script below with an RTX 3090, batch size x axis, time (sec) y axis. This GPU is also used for display graphics and such, so the measurements are pretty noisy, even with 100 samples.## Before![before_nll](https://github.com/user-attachments/assets/c19044aa-7bc2-4223-b560-9be7acedef35)## After ifdef removal![after_nll](https://github.com/user-attachments/assets/4672f5ca-93b0-4c34-a257-81b2ab364995)## After Parallel SMEM reduction![after_reduction](https://github.com/user-attachments/assets/9607b68c-7d9d-4ee0-9f99-8989d134e4fd)```pythonimport torchfrom matplotlib import pyplot as pltfrom torch.nn import functional as Ftiming = []batches=  list(range(32, 4096, 32))for batch in [32] + batches:    samples = []    for _ in range(100):        probs = torch.rand(batch, 10).cuda()        labels = torch.randint(0, 10, (batch,)).cuda()        start = torch.cuda.Event(enable_timing=True)        end = torch.cuda.Event(enable_timing=True)        start.record()        F.nll_loss(probs, labels)        end.record()        torch.cuda.synchronize()        elapsed = start.elapsed_time(end)        samples.append(elapsed)    timing.append(sum(samples) / len(samples))timing = timing[1:]plt.plot(batches, timing)plt.show()```Pull Requestresolved:pytorch#149779Approved by:https://github.com/jeffdaily