NotificationsYou must be signed in to change notification settings
Fork587
Star2.1k

Faster string reversal when source/dest buffers are distinct#23012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

richardleach merged 4 commits intoPerl:bleadfromrichardleach:hydahy/reverse_lesscopy

Mar 12, 2025

Merged

Faster string reversal when source/dest buffers are distinct#23012

richardleach merged 4 commits intoPerl:bleadfromrichardleach:hydahy/reverse_lesscopy

Mar 12, 2025

Conversation

Copy link

Contributor

richardleach commentedFeb 18, 2025

pp_reverse() has traditionally operated on a single string buffer,
reversing it in place.

Whenpp_reverse() takes a single SV argument, whose buffer cannot be
swiped bysv_setsv_flags(), a straight copy of the buffer is taken
and then that is reversed. In such cases, two complete passes over the
entire string are needed to reverse a non-UTF8 string, and reversing a
UTF8 string takes three complete passes.

This PR enablespp_reverse() to apply the "swipe" test itself
and, for straightforward copy cases, to avoid callingsv_setsv_flags().
Instead, it does the work of preparing theTARG SV and reverse copies
the string (UTF8 or otherwise) in a single pass.

Non-UTF8 performance is more than doubled. UTF8 performance is improved
by ~40%. (This is usinggcc version 12.2.0 on Linux running on a midrange Zen3 CPU.)

my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x }

PATCHED

          2,300.42 msec task-clock                       #    0.994 CPUs utilized                           2      context-switches                 #    0.869 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,472      page-faults                      #    1.075 K/sec                      10,523,888,668      cycles                           #    4.575 GHz                             3,407,581      stalled-cycles-frontend          #    0.03% frontend cycles idle           17,098,625      stalled-cycles-backend           #    0.16% backend cycles idle        61,520,603,431      instructions                     #    5.85  insn per cycle             10,255,033,319      branches                         #    4.458 G/sec

BLEAD

          5,693.74 msec task-clock                       #    0.996 CPUs utilized                           4      context-switches                 #    0.703 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,471      page-faults                      #  433.985 /sec                       26,417,106,165      cycles                           #    4.640 GHz                             4,123,476      stalled-cycles-frontend          #    0.02% frontend cycles idle           16,843,163      stalled-cycles-backend           #    0.06% backend cycles idle        41,985,580,059      instructions                     #    1.59  insn per cycle              5,210,998,998      branches                         #  915.215 M/sec

my $x = "\x{263A}\x{263A}x\x{263A}y\x{263A}"x(1024*1000); my $y; for (0..1_000) { $y = reverse $x }

PATCHED

         13,473.52 msec task-clock                       #    0.999 CPUs utilized                           8      context-switches                 #    0.594 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,869      page-faults                      #  212.936 /sec                       64,319,489,828      cycles                           #    4.774 GHz                            19,171,494      stalled-cycles-frontend          #    0.03% frontend cycles idle           29,132,509      stalled-cycles-backend           #    0.05% backend cycles idle       139,440,302,209      instructions                     #    2.17  insn per cycle             32,809,474,409      branches                         #    2.435 G/sec

BLEAD

         21,833.47 msec task-clock                       #    0.999 CPUs utilized                          18      context-switches                 #    0.824 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,362      page-faults                      #  108.183 /sec                       99,685,066,602      cycles                           #    4.566 GHz                            17,824,958      stalled-cycles-frontend          #    0.02% frontend cycles idle           34,272,539      stalled-cycles-backend           #    0.03% backend cycles idle       378,833,407,708      instructions                     #    3.80  insn per cycle             72,845,191,845      branches                         #    3.336 G/sec

This set of changes requires a perldelta entry, and it will be included shortly.

Copy link

ContributorAuthor

richardleach commentedFeb 18, 2025

Note: I have a separate branch tinkering with (1) minor variations on the current algorithms (2) the portable code mentioned in#20692. Likely there will be a subsequent PR for at least the byte-reversal algorithm.

I don't know how to significantly improve the UTF8 algorithms - any suggestions welcome!

richardleach force-pushed thehydahy/reverse_lesscopy branch 3 times, most recently fromf1b0c04 to54f0528Compare

February 21, 2025 12:53

Copy link

ContributorAuthor

richardleach commentedFeb 23, 2025

I added a branch for the single byte UFT8 case while squashing bugs.gcc timings on that benchmark :

my $x = "\x{263A}\x{263A}x\x{263A}y\x{263A}"x(1024*1000); my $y; for (0..1_000) { $y = reverse $x }
Blead was99,685,066,602 cycles.
PR was64,319,489,828 cycles.
PR now52,251,356,819 cycles:

         11,450.39 msec task-clock                       #    0.997 CPUs utilized                           8      context-switches                 #    0.699 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,873      page-faults                      #  250.908 /sec                       52,251,356,819      cycles                           #    4.563 GHz                             8,948,516      stalled-cycles-frontend          #    0.02% frontend cycles idle           26,727,744      stalled-cycles-backend           #    0.05% backend cycles idle       145,588,067,775      instructions                     #    2.79  insn per cycle             32,808,946,621      branches                         #    2.865 G/sec

Compiling withclang-11, the "before" times are both about the same as withgcc, as mentioned in the original PR message. However,clang does a lot better with the new copy algorithms. It manages to vectorizes wheregcc (including later versions up totrunk) does not manage to do so.

my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x }
Blead used about26,417,106,165 cycles.
gcc used about 10,523,888,668 cycles.
clang uses about 3,073,025,623 cycles.

            699.47 msec task-clock                       #    0.924 CPUs utilized                           0      context-switches                 #    0.000 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,456      page-faults                      #    3.511 K/sec                       3,073,025,623      cycles                           #    4.393 GHz                             1,397,171      stalled-cycles-frontend          #    0.05% frontend cycles idle           15,385,164      stalled-cycles-backend           #    0.50% backend cycles idle         8,986,055,265      instructions                     #    2.92  insn per cycle                324,660,921      branches                         #  464.155 M/sec

my $x = "\x{263A}\x{263A}x\x{263A}y\x{263A}"x(1024*1000); my $y; for (0..1_000) { $y = reverse $x }
Blead used about99,685,066,602 cycles.
gcc used about 52,251,356,819 cycles.
clang uses about33,720,311,746 cycles.

          7,474.72 msec task-clock                       #    0.997 CPUs utilized                          13      context-switches                 #    1.739 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,346      page-faults                      #  313.858 /sec                       33,720,311,746      cycles                           #    4.511 GHz                            11,169,631      stalled-cycles-frontend          #    0.03% frontend cycles idle           25,214,673      stalled-cycles-backend           #    0.07% backend cycles idle       182,481,212,299      instructions                     #    5.41  insn per cycle             36,907,239,014      branches                         #    4.938 G/sec

Copy link

thesamesam commentedFeb 23, 2025•
edited
Loading

It manages to vectorizes where gcc (including later versions up to trunk) does not manage to do so.

Please do consider filing a bug with preprocessed sources to GCC for that (even better with a simple testcase; just a function or function(s) which clearly don't vectorise well compared to Clang in godbolt is fine. no need for it to be benchmarkable). Thanks.

tonycoz reviewed

Feb 25, 2025

View reviewed changes

pp.c OutdatedShow resolvedHide resolved

tonycoz reviewed

Feb 25, 2025

View reviewed changes

pp.c OutdatedShow resolvedHide resolved

tonycoz reviewed

Feb 25, 2025

View reviewed changes

pp.c OutdatedShow resolvedHide resolved

richardleach force-pushed thehydahy/reverse_lesscopy branch from54f0528 to1fbc4c4Compare

February 26, 2025 22:46

Copy link

ContributorAuthor

richardleach commentedMar 2, 2025

It manages to vectorizes where gcc (including later versions up to trunk) does not manage to do so.
Please do consider filing a bug with preprocessed sources to GCC for that (even better with a simple testcase; just a function or function(s) which clearly don't vectorise well compared to Clang in godbolt is fine. no need for it to be benchmarkable). Thanks.

https://godbolt.org/z/jz6fxKKhh probably captures this. A simplereverse loop:gcc outputs simple code, butclang goes to town.

#include <stddef.h>voidreverse(char * outp, const char * inps, const char * inpe) {    while (inpe != inps)        *outp++ = *--inpe;}

I got the impression that alias checking under gcc is a known problem and is covered by a metabug, but I can't find the link to it right now. I didn't fancy wading though all the linked issues to check for duplicates at this time. Feel free to report this case though if you believe it to be useful to the gcc folks.

I've got a follow-up commit planned that would give gcc and clang similar performance for string reversal in Perl.

Copy link

Contributor

tonycoz commentedMar 2, 2025

-O3 and a semi-recent-march (skylake,x86-64-v2) lets gcc vectorize this.

Copy link

ContributorAuthor

richardleach commentedMar 3, 2025

-O3 and a semi-recent-march (skylake,x86-64-v2) lets gcc vectorize this.

I gather that building forx86-64-v2 is gaining traction, so that's pretty cool. Are any major distros building with gcc at-O3 though?

Copy link

Member

xenu commentedMar 3, 2025

If those buffers are distinct, therestrict keyword is the way to go.

Copy link

Contributor

tonycoz commentedMar 3, 2025

If those buffers are distinct, therestrict keyword is the way to go.

restrict onoutp for the godbolt example doesn't enable the vectorization, it just eliminates the fallback to the direct implementation on overlap.

Copy link

Member

xenu commentedMar 3, 2025

It does in GCC trunk, and indeed, according to WIPGCC 15 release notes, it is supposed to vectorize more aggressively at -O2.

Copy link

ContributorAuthor

richardleach commentedMar 3, 2025

If those buffers are distinct, therestrict keyword is the way to go.

I had that in originally, but it seemed to break g++ builds?

Copy link

thesamesam commentedMar 3, 2025

It should be okay to add it conditionally: either aPERL_RESTRICT macro defined torestrict for non-C++ builds, or__restrict__ (which is fine forg++ at least).

Copy link

ContributorAuthor

richardleach commentedMar 3, 2025

Given where we are in the dev cycle, are there any objections to me merging this PR as-is to at least get this level of improvement in to v42? (Subject to any other review comments you have on it.)

It seems like there are a bunch of things to look at and doing so - including smoking time - could run us into the hard freeze.

Usingrestrict - I guess we would indeed want a new macro for this?
The following does get non-UTF8 performance for both gcc & clang down to around400 msec task-clock without needing any specific build options. However, I'm unclear on the portability, performance on platform where unaligned access is expensive, and whether this could be slower when/if compilers would vectorize with AVX-512 instructions.

rich@perlporting:~/github/perl5$ git diff 1fbc4c4a1693bc080338fac9b92f1a59d97df499 9c8f640ab644a800ce258d669dbaf6399060dda1diff --git a/pp.c b/pp.cindex 4f0513cc99..3795dd5ab0 100644--- a/pp.c+++ b/pp.c@@ -6526,6 +6526,35 @@ PP(pp_unshift)     return NORMAL; } +/* Swap the byte order of a 64-bit integer */+/* https://dev.to/wunk/fast-array-reversal-with-simd-j3p */+static inline uint64_t S_swap64(uint64_t x)+{+    return ((x & 0x00000000000000FFULL) << 56) |+           ((x & 0x000000000000FF00ULL) << 40) |+           ((x & 0x0000000000FF0000ULL) << 24) |+           ((x & 0x00000000FF000000ULL) <<  8) |+           ((x & 0x000000FF00000000ULL) >>  8) |+           ((x & 0x0000FF0000000000ULL) >> 24) |+           ((x & 0x00FF000000000000ULL) >> 40) |+           ((x & 0xFF00000000000000ULL) >> 56);+}  PP_wrapped(pp_reverse, 0, 1) {@@ -6680,10 +6709,38 @@ PP_wrapped(pp_reverse, 0, 1)                         }                     }                 } else {+                    STRLEN i = 0;+                    STRLEN j = len;                     char * outp= SvPVX(TARG);-                    const char *p = src + len;-                    while (p != src)-                        *outp++ = *--p;++                    /* Where possible, operate on many bytes at once.+                     * This should have generally good performance, but+                     * also stands a good chance of being optimised into+                     * bswap instructions by optimising c compilers. */+                    while (j -i >= 16) {+                        /* Reverse copy 8 bytes from the front of src to the back of outp */+                        *(uint64_t *)(outp + i)+                                = S_swap64( *(uint64_t *)(src + j - 8) );+                        /* Do the same from the back of src to the front of outp */+                        *(uint64_t *)(outp + j - 8)+                                = S_swap64( *(uint64_t *)(src + i) );+                        i += 8;+                        j -= 8;+                    }++                    /* We could also reverse copy 4 bytes, then 2 bytes+                     * at a time. But those loops will execute only a+                     * few times at most. Is it worth the bloat? */++                    /* On 32-bit platforms, should we use S_swap32+                     * instead of S_swap64? */++                    /* Swap any remaining bytes one by one. */+                    while (i < j) {+                        outp[i] =  src[j - 1];+                        outp[j - 1] = src[i];+                        i++; j--;+                    }                 }             RETURN;             }

Should/could we support both of the above, maybe using#ifdefs based off build flags, or even doing something more adventurous..

Copy link

Contributor

tonycoz commentedMar 3, 2025

performance on platform where unaligned access is expensive

unaligned access is undefined behaviour from C, UBSAN will complain about it if nothing else.

Under the hood the compiler can generate code that relies on the platform or special instructions that support unaligned access, just as it does when clang/gcc vectorizes the simple loop.

tonycoz reviewed

Mar 3, 2025

View reviewed changes

pp.c Outdated

		if (DO_UTF8(src_sv)) {
		SvUTF8_on(TARG);

		const U8* s = (U8*)src;

Copy link

Contributor

tonycozMar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This would be better as:

const U8* s = (const U8*)src;

which avoids casting away const.

Similarly below forsend

Copy link

ContributorAuthor

richardleachMar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks, I'll make those two changes.

Copy link

Contributor

tonycoz commentedMar 3, 2025

The main thing I'm hesitant about with this change is hoisting the SvPV_can_swipe_buf() out into a macro, and one where three of the parameters are simple fields of one of the other parameters.

I think it could be a simple function in sv.c taking only the ssv and flags parameters.

If this is defined before Perl_sv_setsv_flags it may even get inlined there.

(other opinions welcome here)

Copy link

ContributorAuthor

richardleach commentedMar 9, 2025

I think it could be a simple function in sv.c taking only the ssv and flags parameters.

Without specifying any explicit optimization settings, gcc does not inline it, but clang does. Feeding-O3 to gcc doesn't change the result.

I suppose the check could be pulled out into a macro in sv.c, then used by bothPerl_sv_setsv_flags and a newPerl_sv_can_swipe_buf function?

Copy link

Contributor

tonycoz commentedMar 9, 2025

I suppose the check could be pulled out into a macro in sv.c, then used by bothPerl_sv_setsv_flags and a newPerl_sv_can_swipe_buf function?

Works for me

richardleach added2 commits

March 10, 2025 22:01

Perl_sv_can_can_swipe_pv_buf - extracts swipe test from Perl_sv_setsv…

2d8d016

…_flagsPerl_sv_setsv_flags contains the canonical logic for determining the bestmethod for assigning the string value from a source SV to a destination SV:  * "Swipe" the string buffer from the source SV  * COW the source SV's string buffer  * Do a full copyThis commit extracts the "can the swipe the buffer" tests out into a newmacro (`S_SvPV_can_swipe_buf`) within sv.c. It has two users:  * Perl_sv_setsv_flags - so that the logic remains inline in this hot code  * Perl_sv_can_can_swipe_pv_buf - a new function`pp_reverse` will shortly make use of the new function to avoid unnecessarystring copies when doing a reversal in scalar context.

t/op/reverse.t: add a test for single character reversal

4a8e016

This could help detect implementation bugs in algorithms for reversecopying from a source buffer to a different destination buffer.

richardleach force-pushed thehydahy/reverse_lesscopy branch from1fbc4c4 to532833cCompare

March 10, 2025 22:28

richardleach added2 commits

March 10, 2025 22:54

Faster string reversal when source/dest buffers are distinct

4472684

`pp_reverse()` has traditionally operated on a single string buffer,reversing it in place.When `pp_reverse()` takes a single SV argument, whose buffer cannot beswiped by `sv_setsv_flags()`, a straight copy of the buffer is takenand then that is reversed. In such cases, two complete passes over theentire string are needed to reverse a non-UTF8 string, and reversing aUTF8 string takes three complete passes.This commit enables `pp_reverse()` to apply the "swipe" test itselfand, for straightforward copy cases, to avoid calling `sv_setsv_flags()`.Instead, it does the work of preparing the `TARG` SV and reverse copiesthe string (UTF8 or otherwise) in a single pass.The performance improvement will vary by compiler and CPU. With gcc 12.2.0on Linux running on "znver3" hardware, performance for both UTF8 andnon-UTF8 strings approximately doubled. Using clang-11 instead on the samemachine gave a fivefold improvement for the non-UTF8 case.

Perldelta for reverse string copies in one pass

d8e448f

richardleach force-pushed thehydahy/reverse_lesscopy branch from532833c tod8e448fCompare

March 10, 2025 22:56

Copy link

ContributorAuthor

richardleach commentedMar 11, 2025

I suppose the check could be pulled out into a macro in sv.c, then used by bothPerl_sv_setsv_flags and a newPerl_sv_can_swipe_buf function?
Works for me

Now implemented like that.

tonycoz approved these changes

Mar 12, 2025

View reviewed changes

richardleach merged commite792078 intoPerl:blead

Mar 12, 2025

richardleach deleted the hydahy/reverse_lesscopy branch

March 12, 2025 09:25

Labels

None yet

4 participants

Movatterモバイル変換

Faster string reversal when source/dest buffers are distinct#23012

Faster string reversal when source/dest buffers are distinct#23012

Uh oh!

Conversation

richardleach commentedFeb 18, 2025

Uh oh!

richardleach commentedFeb 18, 2025

Uh oh!

richardleach commentedFeb 23, 2025

Uh oh!

thesamesam commentedFeb 23, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardleach commentedMar 2, 2025

Uh oh!

tonycoz commentedMar 2, 2025

Uh oh!

richardleach commentedMar 3, 2025

Uh oh!

xenu commentedMar 3, 2025

Uh oh!

tonycoz commentedMar 3, 2025

Uh oh!

xenu commentedMar 3, 2025

Uh oh!

richardleach commentedMar 3, 2025

Uh oh!

thesamesam commentedMar 3, 2025

Uh oh!

richardleach commentedMar 3, 2025

Uh oh!

tonycoz commentedMar 3, 2025

Uh oh!

tonycozMar 3, 2025

Choose a reason for hiding this comment

Uh oh!

richardleachMar 5, 2025

Choose a reason for hiding this comment

Uh oh!

tonycoz commentedMar 3, 2025

Uh oh!

richardleach commentedMar 9, 2025

Uh oh!

tonycoz commentedMar 9, 2025

Uh oh!

richardleach commentedMar 11, 2025

Uh oh!

Uh oh!

thesamesam commentedFeb 23, 2025•
edited
Loading