
This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.
Created on2013-03-07 20:52 byacdha, last changed2022-04-11 14:57 byadmin. This issue is nowclosed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| re_ignore_case_range.patch | serhiy.storchaka,2014-09-08 19:58 | review | ||
| re_ignore_case_range-3.5.patch | serhiy.storchaka,2014-09-17 08:57 | review | ||
| re_ignore_case_range-3.4_2.patch | serhiy.storchaka,2014-09-24 19:17 | review | ||
| re_ignore_case_range-3.5_2.patch | serhiy.storchaka,2014-10-08 19:41 | review | ||
| re_ignore_case_range-3.5_3.patch | serhiy.storchaka,2014-10-09 07:50 | review | ||
| Messages (17) | |||
|---|---|---|---|
| msg183705 -(view) | Author: Chris Adams (acdha) | Date: 2013-03-07 20:52 | |
I noticed an interesting failure while using re.match / re.sub to look for non-Cyrillic characters in allegedly Russian text:>>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', flags=re.IGNORECASE)'Архангельская губерния'>>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0)''The same is true in Python 2.7, although you need to use ur'' patterns for the literals to be expanded:>>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=re.IGNORECASE|regex.UNICODE)u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f'In contrast, the regex module behaves as expected:>>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=regex.IGNORECASE|regex.UNICODE)u''(Transcript maintained athttps://gist.github.com/acdha/5111687) | |||
| msg183712 -(view) | Author: Matthew Barnett (mrabarnett)*![]() | Date: 2013-03-07 23:19 | |
The way the re handles ranges is to convert the two endpoints to lowercase and then check whether the lowercase form of the character in the text is in that range.For example, [A-Z] is converted to the range [\x41-\x5A], and the lowercase form of 'Q' ('\x51') is 'q' ('\x7A'), which is in the range.In your example, [\u0400-\u0527] is converted to the range [\u0450-\u0527], but the lowercase form of 'А' ('\u0410') is 'а' ('\u0430'), which isn't in the range.This is the same as issue#3511, but a worse failure. | |||
| msg183753 -(view) | Author: Chris Adams (acdha) | Date: 2013-03-08 18:22 | |
Ah, that explains it - I'd been hoping based on the re.DEBUG output that the explicit unicode ranges were preserved.I found#3511 before opening this one but don't believe the decision should be the same since this isn't a mixed numeric/alphabetic range. | |||
| msg183988 -(view) | Author: Ezio Melotti (ezio.melotti)*![]() | Date: 2013-03-11 19:50 | |
Matthew, should this be closed then? | |||
| msg183989 -(view) | Author: Chris Adams (acdha) | Date: 2013-03-11 19:59 | |
Ezio: given the non-obvious failure, what do you think of at least documenting this and issuing a warning any time both re.UNICODE and re.IGNORECASE are set? | |||
| msg183992 -(view) | Author: Matthew Barnett (mrabarnett)*![]() | Date: 2013-03-11 21:00 | |
In issue#3511 the range was slightly unusual, so closing it seemed a reasonable approach, but the range in this issue is less clearly a problem. My preference would be to fix it, if possible. | |||
| msg183993 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2013-03-11 21:24 | |
I'm working on the patch. | |||
| msg184016 -(view) | Author: Ezio Melotti (ezio.melotti)*![]() | Date: 2013-03-12 08:11 | |
Is this the same issue described in#12728? | |||
| msg226608 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2014-09-08 19:58 | |
No,issue12728 is more complicate case.Here is a patch which fixes this issue andissue3511. | |||
| msg226989 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2014-09-17 08:57 | |
This patch has a disadvantage - it slows down case-insensitive compiling of some very wide ranges, e.g. compile(r"[\x00-\U0010ffff]+", re.I) (this is worst case). In most cases this is not important, because such wide ranges are rare enough and compiled patterns are cached.To get rid of this regression, we need new opcode. Due to preserving binary compatibility, this approach can't be applied to old releases. Here is a patch for 3.5.Please make a review. This patches are needed to continue fixing of other re bugs. | |||
| msg227485 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2014-09-24 19:17 | |
Here is other patch for 3.4. It is more than 10 times faster than initial patch in worst case. | |||
| msg228814 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2014-10-08 19:41 | |
Actually 3.5 patch can be simpler. | |||
| msg228837 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2014-10-09 07:50 | |
Updated patch for 3.5 addresses Antoine's comments.Note that 3.4 and 3.5 use different solutions of this issue. | |||
| msg229919 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2014-10-24 12:28 | |
Does the patch look good now for you Antoine? If there are no objections I'm going to commit it soon.In order to apply 3.4 patch to 2.7 we need either significant modify the patch, or first backportissue19329 changes to 2.7 (it would be easier). | |||
| msg230332 -(view) | Author: Roundup Robot (python-dev)![]() | Date: 2014-10-31 10:42 | |
New changeset6f52a3d0f548 by Serhiy Storchaka in branch 'default':Issue#17381: Fixed handling of case-insensitive ranges in regular expressions.https://hg.python.org/cpython/rev/6f52a3d0f548New changeset7981cb1556cf by Serhiy Storchaka in branch '3.4':Issue#17381: Fixed handling of case-insensitive ranges in regular expressions.https://hg.python.org/cpython/rev/7981cb1556cf | |||
| msg230336 -(view) | Author: Roundup Robot (python-dev)![]() | Date: 2014-10-31 11:55 | |
New changesetebd48b4f650d by Serhiy Storchaka in branch '2.7':Backported the optimization of compiling charsets in regular expressionshttps://hg.python.org/cpython/rev/ebd48b4f650dNew changeset6cd4b9827755 by Serhiy Storchaka in branch '2.7':Issue#17381: Fixed ranges handling in case-insensitive regular expressions.https://hg.python.org/cpython/rev/6cd4b9827755 | |||
| msg230350 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2014-10-31 16:12 | |
Thank you Antoine for your review. | |||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:57:42 | admin | set | github: 61583 |
| 2014-11-08 12:14:26 | serhiy.storchaka | link | issue3511 superseder |
| 2014-10-31 16:12:16 | serhiy.storchaka | set | status: open -> closed resolution: fixed messages: +msg230350 stage: patch review -> resolved |
| 2014-10-31 11:55:20 | python-dev | set | messages: +msg230336 |
| 2014-10-31 10:42:54 | python-dev | set | nosy: +python-dev messages: +msg230332 |
| 2014-10-24 12:28:34 | serhiy.storchaka | set | messages: +msg229919 |
| 2014-10-09 07:50:41 | serhiy.storchaka | set | files: +re_ignore_case_range-3.5_3.patch dependencies: +Get rid of SRE character tables messages: +msg228837 |
| 2014-10-08 19:41:57 | serhiy.storchaka | set | files: +re_ignore_case_range-3.5_2.patch messages: +msg228814 |
| 2014-09-24 19:17:11 | serhiy.storchaka | set | files: +re_ignore_case_range-3.4_2.patch messages: +msg227485 |
| 2014-09-21 20:45:06 | serhiy.storchaka | link | issue12728 dependencies |
| 2014-09-17 08:57:13 | serhiy.storchaka | set | keywords: +needs review files: +re_ignore_case_range-3.5.patch messages: +msg226989 |
| 2014-09-08 19:58:14 | serhiy.storchaka | set | files: +re_ignore_case_range.patch versions: + Python 3.4, Python 3.5, - Python 3.3 messages: +msg226608 assignee:serhiy.storchaka keywords: +patch stage: patch review |
| 2013-03-12 08:11:31 | ezio.melotti | set | messages: +msg184016 |
| 2013-03-11 21:24:54 | serhiy.storchaka | set | nosy: +serhiy.storchaka messages: +msg183993 |
| 2013-03-11 21:00:24 | mrabarnett | set | messages: +msg183992 |
| 2013-03-11 19:59:42 | acdha | set | messages: +msg183989 |
| 2013-03-11 19:50:21 | ezio.melotti | set | messages: +msg183988 |
| 2013-03-08 18:22:51 | acdha | set | messages: +msg183753 |
| 2013-03-07 23:19:56 | mrabarnett | set | messages: +msg183712 |
| 2013-03-07 20:52:32 | acdha | create | |