- Notifications
You must be signed in to change notification settings - Fork4k
GH-43960: [R] fixstr_sub binding to properly handle negativeend values#44141
GH-43960: [R] fixstr_sub binding to properly handle negativeend values#44141thisisnic merged 2 commits intoapache:mainfrom
str_sub binding to properly handle negativeend values#44141Conversation
thisisnic commentedSep 20, 2024
This looks great - thanks for updating this@coussens, great to see you join the Arrow dev community! I've just approved the workflow runs - once this passes, I'm happy to merge. Feel free to ping me as I'm a bit slow checking my notifications these days! |
Uh oh!
There was an error while loading.Please reload this page.
thisisnic left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks so much for updating this@coussens!
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit37f62d0. There were no benchmark performance regressions. 🎉 Thefull Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Uh oh!
There was an error while loading.Please reload this page.
First-time contributor here, so let me know where I can improve!
Rationale for this change
The
str_subbinding in arrow was not handling negativeendvalues properly. The problem was two-fold:endvalues were negative (and less than thestartvalue, which might be positive),str_subwould improperly return an empty string.endvalues were < -1 but theendposition was still to the right of thestartposition,str_subfailed to return the final character in the substring, since it did not account for the fact thatendis counted exclusively in the underlying C++ function (utf8_slice_codeunits), but inclusively in R.See discussion/examples at#43960 for details.
What changes are included in this PR?
r/R/dplyr-funcs-string.Rthat previously setend= 0 whenstart < end, which meant if the user was counting backwards from the end of the string (with a negativeendvalue), an empty string would [wrongly] be returned. It appears that the case that the previous code was trying to address is already handled properly by the underlying C++ function (utf8_slice_codeunits).r/R/dplyr-funcs-string.Rin order to account the difference in between R's inclusiveendand C++'s exclusiveendwhenendis negative.r/tests/testthat/test-dplyr-funcs-string.Rto test for these cases.Are these changes tested?
Yes, I ran all tests in
r/tests/testthat/test-dplyr-funcs-string.R, including one which I added (see attached commit), which explicitly tests the case whereendis negative (-3) and less than thestartvalue (1). This also tests the case whereend< -1 and to the right of thestartposition.Are there any user-facing changes?
No.
This PR contains a "Critical Fix". Previously:
endvalues were negative (and less than thestartvalue, which might be positive),str_subwould improperly return an empty string.endvalues were < -1 but theendposition was still to the right of thestartposition,str_subfailed to return the final character in the substring, since it did not account for the fact thatendis counted exclusively in the underlying C++ function (utf8_slice_codeunits), but inclusively in R.str_sub()silently mishandles negative start/stop values #43960