Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32.4k
gh-127971: fix off-by-one read beyond the end of a string during search#132574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Fix off-by-one read beyond the end of a string.
Objects/stringlib/fastsearch.h Outdated
@@ -595,7 +595,7 @@ STRINGLIB(default_find)(const STRINGLIB_CHAR* s, Py_ssize_t n, | |||
continue; | |||
} | |||
/* miss: check if next character is part of pattern */ | |||
if (!STRINGLIB_BLOOM(mask, ss[i+1])) { | |||
if (i < w &&!STRINGLIB_BLOOM(mask, ss[i+1])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm totally not sure thatelse
branch is appropiate here for 'i == w2.
Even since we havei > w
for thism
this logic isn't clear for me.
What about simple replacementi <= w
for loop condition with classicali < w
?
It could be enough and cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think the else branch will be OK, since all it does is advance the index, and it is intended and expected that this could potentially advance it past the end of the string, in which case thefor
loop will terminate.
We can't just replace the the loop condition withi < w
since thew
here is the lastvalid index that the pattern could appear at, and needs to be checked. Otherwise we would miss valid matches, and indeed such a change breaks a large number of unit tests.
Note the conditionals that were changed aremiss conditions, i.e. the algorithm has determined the character at the index cannot be part of the pattern at this location in the string. The conditionals modified are checking whether thefollowing character could potentially be part of a pattern hit, so as to determine whether to skip it entirely by advancing the full length of the pattern or only as much as possible while still considering it as a valid potential hit. In the case where we are at the end of the buffer it doesn't actually matter which branch we take, since either way it will advance past it and terminate. We just need to avoid reading the invalid following character when it doesn't exist.
efimov-mikhailJun 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Maybe it's better to rewrite condition asi+1 <= w
? It seems to be more obvious way of checking for me.
It's very similar to "for" condition, but for another argument and before direct using ofi+1
as index.
... the
w
here is the lastvalid index that the pattern could appear at
IMO, rewrited condition slightly reduces cognitive load.
efimov-mikhail commentedJun 19, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I'm not at all an expert in this code, so I've tried to apply your patch by myself. But I suggest some changes. |
I have been hesitant to ping specific developers, as I don't want to annoy people or violate etiquette, but if you think it would be appropriate there are a couple of devs who have worked on this code frequently and/or recently I could tag in?
That is an excellent find, thanks! Agreed: I'll investigate, try to get a test that reproduces the issue there also, and fix it. |
I've just used |
I can trigger the off-by-one reads in the adaptive algorithm easily enough, but I haven't been able to get it to cause an ASAN error as yet, so no luck so far on a test for that. The third variant, /* miss: check if previous character is part of pattern */if (i>0&& !STRINGLIB_BLOOM(mask,s[i-1])) {i=i-m; }else {i=i-skip; } }else {/* skip: check if previous character is part of pattern */if (i>0&& !STRINGLIB_BLOOM(mask,s[i-1])) {i=i-m; } } } |
Does IMO, we can add check and tests for |
Looking closer at Presumably the extra clause in the conditional has some measurable performance impact, however slight, and given the input string minimum length requirements it will likely add up to much more than would be saved by avoiding one unnecessary character read. So, perhaps we shouldn't add it to Nonetheless, I'll push out an update to this that adds an |
…asons forand limitations of the tests.
efimov-mikhail commentedJun 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I suggest to slightly rewrite But those are only cosmetical changes, in other aspects LGTM. |
…mmentexplaining why a guard is not necessary in adaptive_find.
Sure thing, will update, thanks! |
efimov-mikhail left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM. I'd like to have just little rewording in the comment.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I am not fond of tests thatmay trigger a crash and the comments explaining why we have them are too long IMO. I don't see why we can't find an explicit input that triggers the code paths we want (except that it may be hard to craft such input). If it's too hard, we can just.. remove the tests. Or we should indicate in the C code that tests must be updated if we change the implementation.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Misc/NEWS.d/next/Core_and_Builtins/2025-04-16-12-01-13.gh-issue-127971.pMDOQ0.rst OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
I'm not fond of such tests either. I was unable to craft a test case that triggers an ASAN warning, so I will gladly drop it. |
comments, drop extremely marginal "test"
Uh oh!
There was an error while loading.Please reload this page.
Ok, just remove the ASAN guard (my bad). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
After removing the ASAN guard, I'll merge. Note thatdefault_rfind
actually has the checks fori > 0
, so it does make sense to keep them forfind
as well.
OK, will remove the ASAN guard and always run the test. I agree, that seems easiest. Thanks very much for the reviews, both of you, I appreciate it! |
85ec3b3
intopython:mainUh oh!
There was an error while loading.Please reload this page.
Sorry,@duaneg and@picnixz, I could not cleanly backport this to
|
…g search (pythonGH-132574)(cherry picked from commit85ec3b3)Co-authored-by: Duane Griffin <duaneg@dghda.com>
Can you take care of the failed backport? TiA. |
GH-136628 is a backport of this pull request to the3.14 branch. |
IMO, it'll be better to wait few days for@duaneg's response. I also can make such backport. |
With pleasure! It's just so that I can close the issue otherwise I'll likely forget about it :) |
Sorry, is there anything else to do? It looks like the backport PR has now been merged (thanks!) but if there is anything more required, please let me know and I'll take care of it. Apologies for not doing it myself immediately last night, but it was 1am local time and I've not done a (CPython) backport before, so while it all looks quite straight-forward I wanted to wait until morning when I had time and was fully awake 🙂 |
efimov-mikhail commentedJul 14, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Yes, there is some work to be done. Also, this link maybe useful:https://devguide.python.org/contrib/core-team/committing/#backporting-changes-to-an-older-version |
GH-136645 is a backport of this pull request to the3.13 branch. |
GH-136648 is a backport of this pull request to the3.13 branch. |
This first one I made a mistake and tried to merge it into main, sorry. Hopefully the second attempt worked correctly! |
Uh oh!
There was an error while loading.Please reload this page.
Fix cases where the string search algorithm reads one past the end of the character/byte array under certain conditions.