Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
This repository was archived by the owner on Mar 21, 2024. It is now read-only.
/cubPublic archive

Draft of segmented reduce optimization#578

Open
gevtushenko wants to merge7 commits intoNVIDIA:main
base:main
Choose a base branch
Loading
fromgevtushenko:enh-main/github/segmented_reduce

Conversation

gevtushenko
Copy link
Collaborator

This PR applies a technique similar to one in segmented sort algorithm. Segments are partitioned and various thread groups are applied to various segment categories. While optimizing segmented reduction I introduced warp reduce agent and generalized reduce agent implementation. Below are speedups for small segment sizes, best speedup is about 66x:
small

Medium size segments experience minor slowdowns, but it can be addressed by further tuning:
mid

Large size segments are not affected by optimization:
large

In the commits, there's an attempt to fuse small segments reduction with the partitioning stage. This optimization doesn't perform as well. My guess is that it slows down decoupled look-back at the partitioning stage or affects it's occupancy, which leads to overall slowdown.

In order not to break stream capture (if one is used), I incorporated a separate check for that. We might need to check stream capturing mode in our tests later.

ogreen reacted with thumbs up emojiogreen and miscco reacted with hooray emoji
@gevtushenkogevtushenko added the P2: nice to haveDesired, but not necessary. labelSep 30, 2022
@gevtushenko
Copy link
CollaboratorAuthor

Experimented with a deterministic version of large segments optimization. Assigned a number of CTAs per each segment. The optimization is quite expensive in terms of the memory and requires aboutnum_segments * (sizeof(int) + 4 * sizeof(AccumulatorT)). The speedup disappears as soon as there's about 16 large segments (particular number depends on the number of SMs), so I don't think it's worth it. Just in case, pushed and reverted mentioned optimization.
large

Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers
No reviews
Assignees
No one assigned
Labels
P2: nice to haveDesired, but not necessary.
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

1 participant
@gevtushenko

[8]ページ先頭

©2009-2025 Movatter.jp