Draft of segmented reduce optimization#578

Open

gevtushenko wants to merge7 commits intoNVIDIA:main

base:main

Choose a base branch

fromgevtushenko:enh-main/github/segmented_reduce

Open

Draft of segmented reduce optimization#578

gevtushenko wants to merge7 commits intoNVIDIA:mainfromgevtushenko:enh-main/github/segmented_reduce

Conversation

Copy link

Collaborator

gevtushenko commentedSep 30, 2022

This PR applies a technique similar to one in segmented sort algorithm. Segments are partitioned and various thread groups are applied to various segment categories. While optimizing segmented reduction I introduced warp reduce agent and generalized reduce agent implementation. Below are speedups for small segment sizes, best speedup is about 66x:

Medium size segments experience minor slowdowns, but it can be addressed by further tuning:

Large size segments are not affected by optimization:

In the commits, there's an attempt to fuse small segments reduction with the partitioning stage. This optimization doesn't perform as well. My guess is that it slows down decoupled look-back at the partitioning stage or affects it's occupancy, which leads to overall slowdown.

In order not to break stream capture (if one is used), I incorporated a separate check for that. We might need to check stream capturing mode in our tests later.

gevtushenko added5 commits

September 30, 2022 21:22

Optimize segmented reduce

224f433

Fuse partitioning and small segments processing

8239e36

Revert partitioning and small segments processing fusion

2a45a70

This reverts commit8239e36.

Don't query stream capture if there's not enough segments

82dd606

Fix temporary storage names

0121c2e

gevtushenko added the P2: nice to haveDesired, but not necessary. label

Sep 30, 2022

gevtushenko added2 commits

October 9, 2022 04:21

Optimize large segments

84c02eb

Revert large segments optimization

556f139

This reverts commit84c02eb.

Copy link

CollaboratorAuthor

gevtushenko commentedOct 9, 2022

Experimented with a deterministic version of large segments optimization. Assigned a number of CTAs per each segment. The optimization is quite expensive in terms of the memory and requires aboutnum_segments * (sizeof(int) + 4 * sizeof(AccumulatorT)). The speedup disappears as soon as there's about 16 large segments (particular number depends on the number of SMs), so I don't think it's worth it. Just in case, pushed and reverted mentioned optimization.