Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
/perl5Public

Faster string reversal when source/dest buffers are distinct#23012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

richardleach
Copy link
Contributor

pp_reverse() has traditionally operated on a single string buffer,
reversing it in place.

Whenpp_reverse() takes a single SV argument, whose buffer cannot be
swiped bysv_setsv_flags(), a straight copy of the buffer is taken
and then that is reversed. In such cases, two complete passes over the
entire string are needed to reverse a non-UTF8 string, and reversing a
UTF8 string takes three complete passes.

This PR enablespp_reverse() to apply the "swipe" test itself
and, for straightforward copy cases, to avoid callingsv_setsv_flags().
Instead, it does the work of preparing theTARG SV and reverse copies
the string (UTF8 or otherwise) in a single pass.

Non-UTF8 performance is more than doubled. UTF8 performance is improved
by ~40%. (This is usinggcc version 12.2.0 on Linux running on a midrange Zen3 CPU.)

  • my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x }

PATCHED

          2,300.42 msec task-clock                       #    0.994 CPUs utilized                           2      context-switches                 #    0.869 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,472      page-faults                      #    1.075 K/sec                      10,523,888,668      cycles                           #    4.575 GHz                             3,407,581      stalled-cycles-frontend          #    0.03% frontend cycles idle           17,098,625      stalled-cycles-backend           #    0.16% backend cycles idle        61,520,603,431      instructions                     #    5.85  insn per cycle             10,255,033,319      branches                         #    4.458 G/sec

BLEAD

          5,693.74 msec task-clock                       #    0.996 CPUs utilized                           4      context-switches                 #    0.703 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,471      page-faults                      #  433.985 /sec                       26,417,106,165      cycles                           #    4.640 GHz                             4,123,476      stalled-cycles-frontend          #    0.02% frontend cycles idle           16,843,163      stalled-cycles-backend           #    0.06% backend cycles idle        41,985,580,059      instructions                     #    1.59  insn per cycle              5,210,998,998      branches                         #  915.215 M/sec
  • my $x = "\x{263A}\x{263A}x\x{263A}y\x{263A}"x(1024*1000); my $y; for (0..1_000) { $y = reverse $x }

PATCHED

         13,473.52 msec task-clock                       #    0.999 CPUs utilized                           8      context-switches                 #    0.594 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,869      page-faults                      #  212.936 /sec                       64,319,489,828      cycles                           #    4.774 GHz                            19,171,494      stalled-cycles-frontend          #    0.03% frontend cycles idle           29,132,509      stalled-cycles-backend           #    0.05% backend cycles idle       139,440,302,209      instructions                     #    2.17  insn per cycle             32,809,474,409      branches                         #    2.435 G/sec

BLEAD

         21,833.47 msec task-clock                       #    0.999 CPUs utilized                          18      context-switches                 #    0.824 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,362      page-faults                      #  108.183 /sec                       99,685,066,602      cycles                           #    4.566 GHz                            17,824,958      stalled-cycles-frontend          #    0.02% frontend cycles idle           34,272,539      stalled-cycles-backend           #    0.03% backend cycles idle       378,833,407,708      instructions                     #    3.80  insn per cycle             72,845,191,845      branches                         #    3.336 G/sec

  • This set of changes requires a perldelta entry, and it will be included shortly.

@richardleach
Copy link
ContributorAuthor

Note: I have a separate branch tinkering with (1) minor variations on the current algorithms (2) the portable code mentioned in#20692. Likely there will be a subsequent PR for at least the byte-reversal algorithm.

I don't know how to significantly improve the UTF8 algorithms - any suggestions welcome!

@richardleachrichardleachforce-pushed thehydahy/reverse_lesscopy branch 3 times, most recently fromf1b0c04 to54f0528CompareFebruary 21, 2025 12:53
@richardleach
Copy link
ContributorAuthor

I added a branch for the single byte UFT8 case while squashing bugs.gcc timings on that benchmark :

  • my $x = "\x{263A}\x{263A}x\x{263A}y\x{263A}"x(1024*1000); my $y; for (0..1_000) { $y = reverse $x }
    Blead was99,685,066,602 cycles.
    PR was64,319,489,828 cycles.
    PR now52,251,356,819 cycles:
         11,450.39 msec task-clock                       #    0.997 CPUs utilized                           8      context-switches                 #    0.699 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,873      page-faults                      #  250.908 /sec                       52,251,356,819      cycles                           #    4.563 GHz                             8,948,516      stalled-cycles-frontend          #    0.02% frontend cycles idle           26,727,744      stalled-cycles-backend           #    0.05% backend cycles idle       145,588,067,775      instructions                     #    2.79  insn per cycle             32,808,946,621      branches                         #    2.865 G/sec

Compiling withclang-11, the "before" times are both about the same as withgcc, as mentioned in the original PR message. However,clang does a lot better with the new copy algorithms. It manages to vectorizes wheregcc (including later versions up totrunk) does not manage to do so.

  • my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x }
    Blead used about26,417,106,165 cycles.
    gcc used about 10,523,888,668 cycles.
    clang uses about 3,073,025,623 cycles.
            699.47 msec task-clock                       #    0.924 CPUs utilized                           0      context-switches                 #    0.000 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,456      page-faults                      #    3.511 K/sec                       3,073,025,623      cycles                           #    4.393 GHz                             1,397,171      stalled-cycles-frontend          #    0.05% frontend cycles idle           15,385,164      stalled-cycles-backend           #    0.50% backend cycles idle         8,986,055,265      instructions                     #    2.92  insn per cycle                324,660,921      branches                         #  464.155 M/sec
  • my $x = "\x{263A}\x{263A}x\x{263A}y\x{263A}"x(1024*1000); my $y; for (0..1_000) { $y = reverse $x }
    Blead used about99,685,066,602 cycles.
    gcc used about 52,251,356,819 cycles.
    clang uses about33,720,311,746 cycles.
          7,474.72 msec task-clock                       #    0.997 CPUs utilized                          13      context-switches                 #    1.739 /sec                                    0      cpu-migrations                   #    0.000 /sec                                2,346      page-faults                      #  313.858 /sec                       33,720,311,746      cycles                           #    4.511 GHz                            11,169,631      stalled-cycles-frontend          #    0.03% frontend cycles idle           25,214,673      stalled-cycles-backend           #    0.07% backend cycles idle       182,481,212,299      instructions                     #    5.41  insn per cycle             36,907,239,014      branches                         #    4.938 G/sec

@thesamesam
Copy link

thesamesam commentedFeb 23, 2025
edited
Loading

It manages to vectorizes where gcc (including later versions up to trunk) does not manage to do so.

Please do consider filing a bug with preprocessed sources to GCC for that (even better with a simple testcase; just a function or function(s) which clearly don't vectorise well compared to Clang in godbolt is fine. no need for it to be benchmarkable). Thanks.

@richardleach
Copy link
ContributorAuthor

It manages to vectorizes where gcc (including later versions up to trunk) does not manage to do so.

Please do consider filing a bug with preprocessed sources to GCC for that (even better with a simple testcase; just a function or function(s) which clearly don't vectorise well compared to Clang in godbolt is fine. no need for it to be benchmarkable). Thanks.

https://godbolt.org/z/jz6fxKKhh probably captures this. A simplereverse loop:gcc outputs simple code, butclang goes to town.

#include <stddef.h>voidreverse(char * outp, const char * inps, const char * inpe) {    while (inpe != inps)        *outp++ = *--inpe;}

I got the impression that alias checking under gcc is a known problem and is covered by a metabug, but I can't find the link to it right now. I didn't fancy wading though all the linked issues to check for duplicates at this time. Feel free to report this case though if you believe it to be useful to the gcc folks.

I've got a follow-up commit planned that would give gcc and clang similar performance for string reversal in Perl.

thesamesam reacted with thumbs up emoji

@tonycoz
Copy link
Contributor

-O3 and a semi-recent-march (skylake,x86-64-v2) lets gcc vectorize this.

richardleach and thesamesam reacted with thumbs up emoji

@richardleach
Copy link
ContributorAuthor

-O3 and a semi-recent-march (skylake,x86-64-v2) lets gcc vectorize this.

I gather that building forx86-64-v2 is gaining traction, so that's pretty cool. Are any major distros building with gcc at-O3 though?

@xenu
Copy link
Member

xenu commentedMar 3, 2025

If those buffers are distinct, therestrict keyword is the way to go.

@tonycoz
Copy link
Contributor

If those buffers are distinct, therestrict keyword is the way to go.

restrict onoutp for the godbolt example doesn't enable the vectorization, it just eliminates the fallback to the direct implementation on overlap.

@xenu
Copy link
Member

xenu commentedMar 3, 2025

It does in GCC trunk, and indeed, according to WIPGCC 15 release notes, it is supposed to vectorize more aggressively at -O2.

@richardleach
Copy link
ContributorAuthor

If those buffers are distinct, therestrict keyword is the way to go.

I had that in originally, but it seemed to break g++ builds?

@thesamesam
Copy link

It should be okay to add it conditionally: either aPERL_RESTRICT macro defined torestrict for non-C++ builds, or__restrict__ (which is fine forg++ at least).

@richardleach
Copy link
ContributorAuthor

Given where we are in the dev cycle, are there any objections to me merging this PR as-is to at least get this level of improvement in to v42? (Subject to any other review comments you have on it.)

It seems like there are a bunch of things to look at and doing so - including smoking time - could run us into the hard freeze.

  • Usingrestrict - I guess we would indeed want a new macro for this?
  • The following does get non-UTF8 performance for both gcc & clang down to around400 msec task-clock without needing any specific build options. However, I'm unclear on the portability, performance on platform where unaligned access is expensive, and whether this could be slower when/if compilers would vectorize with AVX-512 instructions.
rich@perlporting:~/github/perl5$ git diff 1fbc4c4a1693bc080338fac9b92f1a59d97df499 9c8f640ab644a800ce258d669dbaf6399060dda1diff --git a/pp.c b/pp.cindex 4f0513cc99..3795dd5ab0 100644--- a/pp.c+++ b/pp.c@@ -6526,6 +6526,35 @@ PP(pp_unshift)     return NORMAL; } +/* Swap the byte order of a 64-bit integer */+/* https://dev.to/wunk/fast-array-reversal-with-simd-j3p */+static inline uint64_t S_swap64(uint64_t x)+{+    return ((x & 0x00000000000000FFULL) << 56) |+           ((x & 0x000000000000FF00ULL) << 40) |+           ((x & 0x0000000000FF0000ULL) << 24) |+           ((x & 0x00000000FF000000ULL) <<  8) |+           ((x & 0x000000FF00000000ULL) >>  8) |+           ((x & 0x0000FF0000000000ULL) >> 24) |+           ((x & 0x00FF000000000000ULL) >> 40) |+           ((x & 0xFF00000000000000ULL) >> 56);+}  PP_wrapped(pp_reverse, 0, 1) {@@ -6680,10 +6709,38 @@ PP_wrapped(pp_reverse, 0, 1)                         }                     }                 } else {+                    STRLEN i = 0;+                    STRLEN j = len;                     char * outp= SvPVX(TARG);-                    const char *p = src + len;-                    while (p != src)-                        *outp++ = *--p;++                    /* Where possible, operate on many bytes at once.+                     * This should have generally good performance, but+                     * also stands a good chance of being optimised into+                     * bswap instructions by optimising c compilers. */+                    while (j -i >= 16) {+                        /* Reverse copy 8 bytes from the front of src to the back of outp */+                        *(uint64_t *)(outp + i)+                                = S_swap64( *(uint64_t *)(src + j - 8) );+                        /* Do the same from the back of src to the front of outp */+                        *(uint64_t *)(outp + j - 8)+                                = S_swap64( *(uint64_t *)(src + i) );+                        i += 8;+                        j -= 8;+                    }++                    /* We could also reverse copy 4 bytes, then 2 bytes+                     * at a time. But those loops will execute only a+                     * few times at most. Is it worth the bloat? */++                    /* On 32-bit platforms, should we use S_swap32+                     * instead of S_swap64? */++                    /* Swap any remaining bytes one by one. */+                    while (i < j) {+                        outp[i] =  src[j - 1];+                        outp[j - 1] = src[i];+                        i++; j--;+                    }                 }             RETURN;             }
  • Should/could we support both of the above, maybe using#ifdefs based off build flags, or even doing something more adventurous..

@tonycoz
Copy link
Contributor

  • performance on platform where unaligned access is expensive

unaligned access is undefined behaviour from C, UBSAN will complain about it if nothing else.

Under the hood the compiler can generate code that relies on the platform or special instructions that support unaligned access, just as it does when clang/gcc vectorizes the simple loop.

thesamesam reacted with thumbs up emoji

pp.c Outdated
if (DO_UTF8(src_sv)) {
SvUTF8_on(TARG);

const U8* s = (U8*)src;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This would be better as:

const U8* s = (const U8*)src;

which avoids casting away const.

Similarly below forsend

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks, I'll make those two changes.

@tonycoz
Copy link
Contributor

The main thing I'm hesitant about with this change is hoisting the SvPV_can_swipe_buf() out into a macro, and one where three of the parameters are simple fields of one of the other parameters.

I think it could be a simple function in sv.c taking only the ssv and flags parameters.

If this is defined before Perl_sv_setsv_flags it may even get inlined there.

(other opinions welcome here)

@richardleach
Copy link
ContributorAuthor

I think it could be a simple function in sv.c taking only the ssv and flags parameters.

Without specifying any explicit optimization settings, gcc does not inline it, but clang does. Feeding-O3 to gcc doesn't change the result.

I suppose the check could be pulled out into a macro in sv.c, then used by bothPerl_sv_setsv_flags and a newPerl_sv_can_swipe_buf function?

@tonycoz
Copy link
Contributor

I suppose the check could be pulled out into a macro in sv.c, then used by bothPerl_sv_setsv_flags and a newPerl_sv_can_swipe_buf function?

Works for me

…_flagsPerl_sv_setsv_flags contains the canonical logic for determining the bestmethod for assigning the string value from a source SV to a destination SV:  * "Swipe" the string buffer from the source SV  * COW the source SV's string buffer  * Do a full copyThis commit extracts the "can the swipe the buffer" tests out into a newmacro (`S_SvPV_can_swipe_buf`) within sv.c. It has two users:  * Perl_sv_setsv_flags - so that the logic remains inline in this hot code  * Perl_sv_can_can_swipe_pv_buf - a new function`pp_reverse` will shortly make use of the new function to avoid unnecessarystring copies when doing a reversal in scalar context.
This could help detect implementation bugs in algorithms for reversecopying from a source buffer to a different destination buffer.
`pp_reverse()` has traditionally operated on a single string buffer,reversing it in place.When `pp_reverse()` takes a single SV argument, whose buffer cannot beswiped by `sv_setsv_flags()`, a straight copy of the buffer is takenand then that is reversed. In such cases, two complete passes over theentire string are needed to reverse a non-UTF8 string, and reversing aUTF8 string takes three complete passes.This commit enables `pp_reverse()` to apply the "swipe" test itselfand, for straightforward copy cases, to avoid calling `sv_setsv_flags()`.Instead, it does the work of preparing the `TARG` SV and reverse copiesthe string (UTF8 or otherwise) in a single pass.The performance improvement will vary by compiler and CPU. With gcc 12.2.0on Linux running on "znver3" hardware, performance for both UTF8 andnon-UTF8 strings approximately doubled. Using clang-11 instead on the samemachine gave a fivefold improvement for the non-UTF8 case.
@richardleach
Copy link
ContributorAuthor

I suppose the check could be pulled out into a macro in sv.c, then used by bothPerl_sv_setsv_flags and a newPerl_sv_can_swipe_buf function?

Works for me

Now implemented like that.

@richardleachrichardleach merged commite792078 intoPerl:bleadMar 12, 2025
@richardleachrichardleach deleted the hydahy/reverse_lesscopy branchMarch 12, 2025 09:25
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@tonycoztonycoztonycoz approved these changes

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@richardleach@thesamesam@tonycoz@xenu

[8]ページ先頭

©2009-2025 Movatter.jp