NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

[ROCm] NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Selecting # of GPU threads#149548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

apakbin wants to merge3 commits intopytorch:mainfromapakbin:nll_loss_tune

Closed

[ROCm] NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Selecting # of GPU threads#149548

apakbin wants to merge3 commits intopytorch:mainfromapakbin:nll_loss_tune

Conversation

Copy link

Contributor

apakbin commentedMar 19, 2025•
edited by pytorch-botbot
Loading

Instead of fixing the number of GPU threads to 32 regardless of input size, this PR dynamically selects the number of threads based on the formula: clamp(2^round(log2(dim0/16)), min = 32, max = 1024). The experiments below were done on an MI300 machine for data type float32:

cc@jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

apakbin requested review fromeqy andsyed-ahmed ascode owners

March 19, 2025 19:48

pytorch-botbot added the release notes: cudarelease notes category label

Mar 19, 2025

Copy link

pytorch-botbot commentedMar 19, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/149548

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit0827d28 with merge baseffa0853 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorchbot added the open source label

Mar 19, 2025

apakbin changed the title~~NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Selecting # of GPU threads~~[ROCm] NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Selecting # of GPU threads

Mar 19, 2025

pytorch-botbot added the module: rocmAMD GPU support for Pytorch label

Mar 19, 2025

apakbin marked this pull request as draft

March 19, 2025 22:47

pytorch deleted a comment fromfacebook-github-bot

Mar 19, 2025

jithunnair-amd added the ciflow/rocmTrigger "default" config CI on ROCm label

Mar 19, 2025

jerrymannil reviewed

Mar 19, 2025

View reviewed changes

aten/src/ATen/native/cuda/Loss.cu OutdatedShow resolvedHide resolved

pytorch-botbot removed the ciflow/rocmTrigger "default" config CI on ROCm label

Mar 19, 2025

jeffdaily reviewed

Mar 20, 2025

View reviewed changes

aten/src/ATen/native/cuda/Loss.cuShow resolvedHide resolved

jeffdaily added the ciflow/rocmTrigger "default" config CI on ROCm label

Mar 20, 2025

jeffdaily approved these changes

Mar 20, 2025

View reviewed changes

Copy link

Collaborator

jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

approved, pending clean CI

Copy link

ContributorAuthor

apakbin commentedMar 20, 2025

The benchmark we ran:

repro.txt

Copy link

ContributorAuthor

apakbin commentedMar 20, 2025

@pytorchbot rebase

Copy link

Collaborator

pytorchmergebot commentedMar 20, 2025

@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere

Copy link

Collaborator

pytorchmergebot commentedMar 20, 2025

Successfully rebasednll_loss_tune ontorefs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, viagit checkout nll_loss_tune && git pull --rebase)

pytorchmergebot force-pushed thenll_loss_tune branch from6e8dc92 tobd13475Compare

March 20, 2025 17:22

pytorch-botbot removed the ciflow/rocmTrigger "default" config CI on ROCm label

Mar 20, 2025

pruthvistony approved these changes

Mar 20, 2025

View reviewed changes

pruthvistony added ciflow/periodic

Trigger jobs ran periodically on master (periodic.yml) on the PR

ciflow/rocm

Trigger "default" config CI on ROCm

ciflow/inductor-rocm

Trigger "inductor" config CI on ROCm

ciflow/rocm-mi300Trigger "default" config CI on ROCm MI300 labels

Mar 20, 2025

Copy link

pytorch-botbot commentedMar 20, 2025

To add the ciflow labelciflow/periodic please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Copy link

pytorch-botbot commentedMar 20, 2025

To add the ciflow labelciflow/inductor-rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Copy link

pytorch-botbot commentedMar 20, 2025

To add the ciflow labelciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Copy link

pytorch-botbot commentedMar 20, 2025

To add the ciflow labelciflow/rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

jeffdaily marked this pull request as ready for review

March 21, 2025 00:01

Copy link

Collaborator

jeffdaily commentedMar 21, 2025

@pytorchbot rebase

Copy link

Collaborator

pytorchmergebot commentedMar 21, 2025

@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere

apakbin added3 commits

March 21, 2025 00:03

changed nll_loss number of threads from 32 to dynamically choosing cl…

2d5700a

…amp(2^round(log2(dim0/16)), min = 32, max = 1024)

added ROCM directives

8f87e58

put the ROCm directive inside the function as suggested by Jerry

0827d28

Copy link

Collaborator

pytorchmergebot commentedMar 21, 2025

Successfully rebasednll_loss_tune ontorefs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, viagit checkout nll_loss_tune && git pull --rebase)

pytorchmergebot force-pushed thenll_loss_tune branch frombd13475 to0827d28Compare

March 21, 2025 00:03

pytorch-botbot removed ciflow/periodic

Trigger jobs ran periodically on master (periodic.yml) on the PR

ciflow/rocm

Trigger "default" config CI on ROCm

ciflow/inductor-rocm

Trigger "inductor" config CI on ROCm

ciflow/rocm-mi300Trigger "default" config CI on ROCm MI300 labels

Mar 21, 2025

jeffdaily added ciflow/rocm

Trigger "default" config CI on ROCm

ciflow/rocm-mi300Trigger "default" config CI on ROCm MI300 labels

Mar 21, 2025

Copy link

Collaborator

jeffdaily commentedMar 21, 2025

@pytorchbot merge

pytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request label

Mar 21, 2025

pytorchmergebot added the merging label

Mar 21, 2025

Copy link

Collaborator

pytorchmergebot commentedMar 21, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

Mar 21, 2025

pytorchmergebot closed this incfc08ca

Mar 21, 2025

pytorchmergebot removed the merging label

Mar 21, 2025

apakbin added a commit to ROCm/pytorch that referenced this pull request

Mar 21, 2025

mirroringpytorch#149548

1d132e9

apakbin mentioned this pull request

Mar 21, 2025

[release 2.5] [ROCm] NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Selecting # of GPU threads #149548ROCm/pytorch#1992

Merged

jerrymannil pushed a commit to ROCm/pytorch that referenced this pull request

Mar 21, 2025

mirroringpytorch#149548

b2e4f3d

svekars pushed a commit that referenced this pull request

Mar 21, 2025

[ROCm] NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Sel…

bf324fd

…ecting # of GPU threads (#149548)Instead of fixing the number of GPU threads to 32 regardless of input size, this PR dynamically selects the number of threads based on the formula: clamp(2^round(log2(dim0/16)), min = 32, max = 1024). The experiments below were done on an MI300 machine for data type float32:![nll_loss_threads_bests](https://github.com/user-attachments/assets/3be3d465-e3db-44ed-991a-fdfcab03baae)![nll_loss_heauristic](https://github.com/user-attachments/assets/e82b9788-9b4d-4862-a180-8df7ad298182)Pull Requestresolved:#149548Approved by:https://github.com/jeffdaily,https://github.com/pruthvistony

5had3z mentioned this pull request

Mar 21, 2025

Removed ROCM ifdef that governs thread count + smem parallel reduction.#149779

Closed

pytorchmergebot pushed a commit that referenced this pull request

Mar 29, 2025

Removed ROCM ifdef that governs thread count + smem parallel reductio…

beea760

…n. (#149779)#149548 Fixed the arbitrarily missing parallelism for NLL, but they also added an arbritrary #ifdef ROCM guard around this fix to prevent its use on CUDA gpus. There is also a problem with the way the kernel does the reduction from the intermediate shared memory, using only thread 0 walking linearly. This has been changed to a simple parallel reduction algorithm.Tested changes with `python3 test/test_nn.py````Ran 3551 tests in 200.554sOK (skipped=998, expected failures=4)```Performance before and after with the script below with an RTX 3090, batch size x axis, time (sec) y axis. This GPU is also used for display graphics and such, so the measurements are pretty noisy, even with 100 samples.## Before![before_nll](https://github.com/user-attachments/assets/c19044aa-7bc2-4223-b560-9be7acedef35)## After ifdef removal![after_nll](https://github.com/user-attachments/assets/4672f5ca-93b0-4c34-a257-81b2ab364995)## After Parallel SMEM reduction![after_reduction](https://github.com/user-attachments/assets/9607b68c-7d9d-4ee0-9f99-8989d134e4fd)```pythonimport torchfrom matplotlib import pyplot as pltfrom torch.nn import functional as Ftiming = []batches=  list(range(32, 4096, 32))for batch in [32] + batches:    samples = []    for _ in range(100):        probs = torch.rand(batch, 10).cuda()        labels = torch.randint(0, 10, (batch,)).cuda()        start = torch.cuda.Event(enable_timing=True)        end = torch.cuda.Event(enable_timing=True)        start.record()        F.nll_loss(probs, labels)        end.record()        torch.cuda.synchronize()        elapsed = start.elapsed_time(end)        samples.append(elapsed)    timing.append(sum(samples) / len(samples))timing = timing[1:]plt.plot(batches, timing)plt.show()```Pull Requestresolved:#149779Approved by:https://github.com/jeffdaily

amathewc pushed a commit to amathewc/pytorch that referenced this pull request

Apr 17, 2025

[ROCm] NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Sel…

46362be

…ecting # of GPU threads (pytorch#149548)Instead of fixing the number of GPU threads to 32 regardless of input size, this PR dynamically selects the number of threads based on the formula: clamp(2^round(log2(dim0/16)), min = 32, max = 1024). The experiments below were done on an MI300 machine for data type float32:![nll_loss_threads_bests](https://github.com/user-attachments/assets/3be3d465-e3db-44ed-991a-fdfcab03baae)![nll_loss_heauristic](https://github.com/user-attachments/assets/e82b9788-9b4d-4862-a180-8df7ad298182)Pull Requestresolved:pytorch#149548Approved by:https://github.com/jeffdaily,https://github.com/pruthvistony

amathewc pushed a commit to amathewc/pytorch that referenced this pull request

Apr 17, 2025

Removed ROCM ifdef that governs thread count + smem parallel reductio…

170427a

…n. (pytorch#149779)pytorch#149548 Fixed the arbitrarily missing parallelism for NLL, but they also added an arbritrary #ifdef ROCM guard around this fix to prevent its use on CUDA gpus. There is also a problem with the way the kernel does the reduction from the intermediate shared memory, using only thread 0 walking linearly. This has been changed to a simple parallel reduction algorithm.Tested changes with `python3 test/test_nn.py````Ran 3551 tests in 200.554sOK (skipped=998, expected failures=4)```Performance before and after with the script below with an RTX 3090, batch size x axis, time (sec) y axis. This GPU is also used for display graphics and such, so the measurements are pretty noisy, even with 100 samples.## Before![before_nll](https://github.com/user-attachments/assets/c19044aa-7bc2-4223-b560-9be7acedef35)## After ifdef removal![after_nll](https://github.com/user-attachments/assets/4672f5ca-93b0-4c34-a257-81b2ab364995)## After Parallel SMEM reduction![after_reduction](https://github.com/user-attachments/assets/9607b68c-7d9d-4ee0-9f99-8989d134e4fd)```pythonimport torchfrom matplotlib import pyplot as pltfrom torch.nn import functional as Ftiming = []batches=  list(range(32, 4096, 32))for batch in [32] + batches:    samples = []    for _ in range(100):        probs = torch.rand(batch, 10).cuda()        labels = torch.randint(0, 10, (batch,)).cuda()        start = torch.cuda.Event(enable_timing=True)        end = torch.cuda.Event(enable_timing=True)        start.record()        F.nll_loss(probs, labels)        end.record()        torch.cuda.synchronize()        elapsed = start.elapsed_time(end)        samples.append(elapsed)    timing.append(sum(samples) / len(samples))timing = timing[1:]plt.plot(batches, timing)plt.show()```Pull Requestresolved:pytorch#149779Approved by:https://github.com/jeffdaily