Movatterモバイル変換


[0]ホーム

URL:


homepage

Issue1647489

This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title:zero-length match confuses re.finditer()
Type:behaviorStage:resolved
Components:Library (Lib), Regular ExpressionsVersions:Python 3.7, Python 3.6, Python 2.7
process
Status:closedResolution:fixed
Dependencies:Superseder:
Assigned To: serhiy.storchakaNosy List: THRlWiTi, denversc, ezio.melotti, isoschiz, jfrechet, mrabarnett, niemeyer, rsc, serhiy.storchaka, timehorse
Priority:normalKeywords:patch

Created on2007-01-29 22:35 byjfrechet, last changed2022-04-11 14:56 byadmin. This issue is nowclosed.

Pull Requests
URLStatusLinkedEdit
PR 4471mergedserhiy.storchaka,2017-11-19 23:36
PR 4678closedserhiy.storchaka,2017-12-02 17:32
Messages (15)
msg31129 -(view)Author: Jacques Frechet (jfrechet)Date: 2007-01-29 22:35
Hi!re.finditer() seems to incorrectly increment the current position immediately after matching a zero-length substring.  For example:>>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')][('', None), (None, 'bc')]What happened to the 'a'?  I expected this result:[('', None), (None, 'abc')]Perl agrees with me:% perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" while "abc" =~ /(z*)|(\w+)/g' "",undefundef,"abc""",undefSimilarly, if I remove the ^:>>> [m.groups() for m in re.finditer(r'(z*)|(\w+)', 'abc')][('', None), ('', None), ('', None), ('', None)]Now all of the letters have fallen through the cracks!  I expected this result:[('', None), (None, 'abc'), ('', None)]Again, perl agrees:% perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" while "abc" =~ /(z*)|(\w+)/g' "",undefundef,"abc""",undefIf this bug has already been reported, I apologize -- I wasn't able to find it here.  I haven't looked at the code for the re module, but this seems like the sort of bug that might have been accidentally introduced in order to try to prevent the same zero-length match from being returned forever.Thanks,Jacques
msg73719 -(view)Author: Matthew Barnett (mrabarnett)*(Python triager)Date: 2008-09-24 15:37
This also affects re.findall().
msg73737 -(view)Author: Matthew Barnett (mrabarnett)*(Python triager)Date: 2008-09-24 17:59
What should:    [m.groups() for m in re.finditer(r'(^z*)|(^q*)|(\w+)', 'abc')]return? Should the second group also yield a zero-width match before thethird group is tried? I think it probably should. Does Perl?
msg73741 -(view)Author: Jeffrey C. Jacobs (timehorse)Date: 2008-09-24 18:14
Hmmm.  This strikes me as a bug, beyond the realm ofIssue 3262.  Thetwo items may be related, but the dropping of the 'a' seems likeunexpected behaviour that I doubt any current code is expecting tooccur.  Clearly, what is going on is that the Engine starts scanning atthe 'a', finds the Zero-Width match and, having found a match,increments its pointer within the input string, thus skipping the 'a'when it matches 'bc'.If it is indeed a bug, I think this should be considered for inclusionin Python 2.6 rather than being part of the new Engine Design inIssue3626.  I think the solution would simply be to not increment the ptr(which points to the input string) when findall / finditer encounters aZero-Width match.
msg73742 -(view)Author: Jeffrey C. Jacobs (timehorse)Date: 2008-09-24 18:21
Never mind inclusion in 2.6 as no-one has repeated this bug in re-worldexamples yet so it's going to have to wait for the Regexp 2.7 engine inissue 2636.
msg73746 -(view)Author: Jeffrey C. Jacobs (timehorse)Date: 2008-09-24 18:33
Ah, I see the problem, if ptr is not incremented, then it will keepmatching the first expression, (^z*), so it would have to both 'skip'the 'a' and NOT skip the 'a'.  Hmm.  You're right, Matthew, this ispretty complicated.  Now, for your expression, Matthew,r'(z*)|(^q*)|(\w+)', Perl gives:"",undef,undefundef,undef,"abc""",undef,undefMeaning it doesn't even bother matching the ^q* since the ^z* matchesfirst.  This seems the logical behaviour and fits with the idea that aZero-Width match would both only match once and NOT consume anycharacters.  An internal flag would just have to be created to tell the2 find functions whether the current value of ptr would allow for a "NoZero-Width Match" option on second go-around.
msg73755 -(view)Author: Matthew Barnett (mrabarnett)*(Python triager)Date: 2008-09-24 21:14
What about r'(^z*)|(q*)|(\w+)'? I could imagine that the first groupcould match only at the start of the string, but if the second groupdoesn't have that restriction then it could match the second time, andonly after that could the third match, if you see what I mean. (Theprevious example had (^q*) so it couldn't match because the first grouphas already matched at the start of the string and we've alreadyadvanced beyond that, even though by no characters!)
msg73765 -(view)Author: Matthew Barnett (mrabarnett)*(Python triager)Date: 2008-09-24 23:34
FYI, I postedmsg73737 after finding that the fix for the original casewas really very simple, but then thought about whether it would behaveas expected when there were more zero-width matches, hence the later posts.
msg73789 -(view)Author: Jeffrey C. Jacobs (timehorse)Date: 2008-09-25 13:08
Perl gives this result for your new expression:"",undef,undefundef,undef,"abc"undef,"",undefI think it has to do with not thinking of a string as a sequence ofcharacters, but as a sequence of characters separated by null-space. Null-space is can be captured, but ONLY if it is part of a zero-widthmatch, and once captured, it can no longer be captured by anotherzero-width expression.  This is in keeping which what I see as Perl'sbehaviour, namely that the (q*) group never participates in the firstmatch because, initially the (^z*) captures it.  OTOH, when it gets tothe null-space AFTER the 'abc' capture, the (^z*) cannot participatebecause it has a "at-beginning" restriction.  The evaluator then moveson to the (q*), which has no such restriction and this time it matches,consuming the final null-space.
msg73792 -(view)Author: Matthew Barnett (mrabarnett)*(Python triager)Date: 2008-09-25 13:55
I have to report that the fix appears to be successful:>>> print [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')][('', None), (None, 'abc')]>>> print re.findall(r"(^z*)|(\w+)", "abc")[('', ''), ('', 'abc')]>>> print [m.groups() for m in re.finditer(r"(^z*)|(q*)|(\w+)", "abc")][('', None, None), (None, None, 'abc'), (None, '', None)]>>> print re.findall(r"(^z*)|(q*)|(\w+)", "abc")[('', '', ''), ('', '', 'abc'), ('', '', '')]The patch is regex_2.6rc2+7.diff.
msg73809 -(view)Author: Jeffrey C. Jacobs (timehorse)Date: 2008-09-25 19:22
Matthew, I'll try to merge all your diffs with the current repositoryover the weekend.  Having done the first, I know where code differsbetween your implementation, mine and the base, so I can apply yourpatch, and then a patch that restores my changes so the rest of themerges should be easy!  :)
msg132827 -(view)Author: Denver Coneybeare (denversc)*Date: 2011-04-03 01:59
I just re-tested this issue in trunk at changeset053bc5ca199b and the issue is still exactly reproducible as originally reported.  That is, the match to the empty string skips a character of the match:>>> import re>>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')][('', None), (None, 'bc')]
msg187318 -(view)Author: Martin Morrison (isoschiz)*Date: 2013-04-19 00:21
This is still an issue today:>>> import re>>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')][('', None), (None, 'bc')]
msg221979 -(view)Author: Mark Lawrence (BreamoreBoy)*Date: 2014-06-30 20:15
How does "the Regexp 2.7 engine inissue 2636" frommsg73742 deal with this situation?
msg307556 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2017-12-04 12:29
New changeset70d56fb52582d9d3f7c00860d6e90570c6259371 by Serhiy Storchaka in branch 'master':bpo-25054,bpo-1647489: Added support of splitting on zerowidth patterns. (#4471)https://github.com/python/cpython/commit/70d56fb52582d9d3f7c00860d6e90570c6259371
History
DateUserActionArgs
2022-04-11 14:56:22adminsetgithub: 44519
2018-03-14 17:17:12serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-12-04 12:29:09serhiy.storchakasetmessages: +msg307556
2017-12-02 17:32:36serhiy.storchakasetpull_requests: +pull_request4587
2017-11-19 23:36:58serhiy.storchakasetkeywords: +patch
stage: patch review
pull_requests: +pull_request4404
2017-11-18 15:55:13serhiy.storchakasetnosy: +ezio.melotti
type: behavior
components: + Library (Lib)
2017-11-18 15:55:00serhiy.storchakasetassignee:niemeyer ->serhiy.storchaka

nosy: +serhiy.storchaka
versions: + Python 3.6, Python 3.7
2016-11-05 17:51:35BreamoreBoysetnosy: -BreamoreBoy
2016-11-05 14:23:33THRlWiTisetnosy: +THRlWiTi
2014-06-30 20:15:38BreamoreBoysetnosy: +BreamoreBoy
messages: +msg221979
2013-04-19 00:21:29isoschizsetnosy: +isoschiz
messages: +msg187318
2011-04-03 01:59:19denverscsetnosy: +denversc
messages: +msg132827
2008-09-25 19:22:06timehorsesetmessages: +msg73809
2008-09-25 13:55:38mrabarnettsetmessages: +msg73792
2008-09-25 13:08:16timehorsesetmessages: +msg73789
2008-09-24 23:34:19mrabarnettsetmessages: +msg73765
2008-09-24 21:14:13mrabarnettsetmessages: +msg73755
2008-09-24 18:33:08timehorsesetmessages: +msg73746
2008-09-24 18:21:05timehorsesetmessages: +msg73742
versions: + Python 2.7, - Python 2.5
2008-09-24 18:14:27timehorsesetmessages: +msg73741
2008-09-24 17:59:56mrabarnettsetmessages: +msg73737
2008-09-24 17:35:57timehorsesetnosy: +timehorse
2008-09-24 15:37:46mrabarnettsetnosy: +mrabarnett
messages: +msg73719
2008-04-24 21:08:30rscsetnosy: +rsc
2007-01-29 22:35:21jfrechetcreate
Supported byThe Python Software Foundation,
Powered byRoundup
Copyright © 1990-2022,Python Software Foundation
Legal Statements

[8]ページ先頭

©2009-2026 Movatter.jp