Movatterモバイル変換


[0]ホーム

URL:


homepage

Issue30349

This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title:Preparation for advanced set syntax in regular expressions
Type:enhancementStage:resolved
Components:Library (Lib), Regular ExpressionsVersions:Python 3.7
process
Status:closedResolution:fixed
Dependencies:Superseder:
Assigned To: serhiy.storchakaNosy List: Tim.Graham, ezio.melotti, mrabarnett, pombredanne, r.david.murray, rhettinger, serhiy.storchaka
Priority:normalKeywords:

Created on2017-05-12 08:14 byserhiy.storchaka, last changed2022-04-11 14:58 byadmin. This issue is nowclosed.

Pull Requests
URLStatusLinkedEdit
PR 1553mergedserhiy.storchaka,2017-05-12 08:20
Messages (8)
msg293532 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2017-05-12 08:14
Currently the re module supports only simple sets. They can include literal characters, character ranges, some simple character classes and support the negation. The Unicode standard [1] defines set operations (union, intersection, difference and symmetric difference) and nested sets. Some regular expression engines implemented these features, for example the regex module supports all TR18 features except not-nested POSIX character classes.If replace the re module with the regex module or add support of these features in the re module and make this syntax enabled by default, this will break some code. It is very unlikely the the regular expression contains duplicated characters ('--', '||', '&&' or '~~'), but nested sets uses just '[', and non-escaped '[' is occurred in character sets in regular expressions (even the stdlib contains several occurrences).Proposed patch adds FutureWarnings emitted when possible breaking set construct ('--', '||', '&&', '~~' or '[') is occurred in a regular expression. We need one or two releases with a warning before changing syntax. The patch also makes re.escape() escaping '&' and '~' and fixes several regular expression in the stdlib.Alternatively the support of new set syntax could be enabled by special flag.I'm not sure that the support of set operations and nested sets is necessary. This complicates the syntax of regular expressions (which already is not simple). Currently set operations can be emulated with lookarounds:[set1||set2] -- (?:[set1]|[set2])[set1&&set2] -- [set1](?<=[set2]) or (?=[set1])[set2][set1--set2] -- [set1](?<![set2]) or [set1](?<=[^set2]) or (?=[set1])[^set2][set1~~set2] -- recursively expand [[set1||set2]--[set1&&set2]][1]http://unicode.org/reports/tr18/#Subtraction_and_Intersection
msg303757 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2017-10-05 10:26
Made a warning for '[' be emitted only at the start of a set. This significantly decrease the breakage of other code. I think we can get around without implicit union of nested sets, like in [_[0-9][:Latin:]]. This can be written as [_||[0-9]||[:Latin:]].
msg306349 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2017-11-16 10:38
New changeset05cb728d68a278d11466f9a6c8258d914135c96c by Serhiy Storchaka in branch 'master':bpo-30349: Raise FutureWarning for nested sets and set operations (#1553)https://github.com/python/cpython/commit/05cb728d68a278d11466f9a6c8258d914135c96c
msg311682 -(view)Author: Tim Graham (Tim.Graham)*Date: 2018-02-05 19:23
It might be worth adding part of the problematic regex to the warning message. For Django's tests, I see an error like "FutureWarning: Possible nested set at position 17 return re.compile(res).match". It took some effort to track down the source.A partial traceback is:  File "/home/tim/code/django/django/core/management/commands/loaddata.py", line 247, in find_fixtures    for candidate in glob.iglob(glob.escape(path) + '*'):  File "/home/tim/code/cpython/Lib/glob.py", line 72, in _iglob    for name in glob_in_dir(dirname, basename, dironly):  File "/home/tim/code/cpython/Lib/glob.py", line 83, in _glob1    return fnmatch.filter(names, pattern)  File "/home/tim/code/cpython/Lib/fnmatch.py", line 52, in filter    match = _compile_pattern(pat)  File "/home/tim/code/cpython/Lib/fnmatch.py", line 46, in _compile_pattern    return re.compile(res).match  File "/home/tim/code/cpython/Lib/re.py", line 240, in compile    return _compile(pattern, flags)  File "/home/tim/code/cpython/Lib/re.py", line 292, in _compile    p = sre_compile.compile(pattern, flags)  File "/home/tim/code/cpython/Lib/sre_compile.py", line 764, in compile    p = sre_parse.parse(p, flags)  File "/home/tim/code/cpython/Lib/sre_parse.py", line 930, in parse    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)  File "/home/tim/code/cpython/Lib/sre_parse.py", line 426, in _parse_sub    not nested and not items))  File "/home/tim/code/cpython/Lib/sre_parse.py", line 816, in _parse    p = _parse_sub(source, state, sub_verbose, nested + 1)  File "/home/tim/code/cpython/Lib/sre_parse.py", line 426, in _parse_sub    not nested and not items))  File "/home/tim/code/cpython/Lib/sre_parse.py", line 524, in _parse    FutureWarning, stacklevel=nested + 6FutureWarning: Possible nested set at position 17As an aside, I'm not sure how to fix the warning in Django. It comes from the test added inhttps://github.com/django/django/commit/98df288ddaba9787e4a370f12aba51c2b9133142 where a path like 'tests/fixtures/fixtures/fixture_with[special]chars' is run through glob.escape() which creates 'tests/fixtures/fixtures/fixture_with[[]special]chars'.
msg311684 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2018-02-05 19:43
Good catch! fnmatch.translate() can produce a pattern which emits a warning when compiled. Could you please open a separate issue for this?
msg311688 -(view)Author: Tim Graham (Tim.Graham)*Date: 2018-02-05 20:08
Okay, I created#32775.
msg402299 -(view)Author: Philippe Ombredanne (pombredanne)*Date: 2021-09-21 09:23
FWIW, this warning is annoying because it is hard to fix in the case where the regex are source from data: the warning message does not include the regex at fault; it should otherwise the warning is noisy and ineffective IMHO.
msg402303 -(view)Author: Philippe Ombredanne (pombredanne)*Date: 2021-09-21 09:51
Sorry, my comment was at best nonsensical gibberish!I meant to say that this warning message should include the actual regex at fault; otherwise it is hard to fix when the regex in question comes from some data structure like a list; then the line number where the warning occurs is not enough to fix the issue; the code needs to be instrumented first to catch warning which is rather heavy handed to handle a warning.
History
DateUserActionArgs
2022-04-11 14:58:46adminsetgithub: 74534
2021-09-21 09:51:26pombredannesetmessages: +msg402303
2021-09-21 09:23:52pombredannesetnosy: +pombredanne
messages: +msg402299
2018-02-05 20:08:17Tim.Grahamsetmessages: +msg311688
2018-02-05 19:43:06serhiy.storchakasetmessages: +msg311684
2018-02-05 19:23:15Tim.Grahamsetnosy: +Tim.Graham
messages: +msg311682
2017-11-16 10:39:01serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-11-16 10:38:33serhiy.storchakasetmessages: +msg306349
2017-10-05 10:26:52serhiy.storchakasetmessages: +msg303757
2017-05-12 08:20:39serhiy.storchakasetpull_requests: +pull_request1650
2017-05-12 08:14:13serhiy.storchakacreate
Supported byThe Python Software Foundation,
Powered byRoundup
Copyright © 1990-2022,Python Software Foundation
Legal Statements

[8]ページ先頭

©2009-2026 Movatter.jp