
This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.
Created on2019-07-30 22:02 byyannvgn, last changed2022-04-11 14:59 byadmin. This issue is nowclosed.
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 15030 | merged | yannvgn,2019-07-30 22:03 | |
| PR 15059 | merged | miss-islington,2019-07-31 18:50 | |
| PR 15060 | merged | miss-islington,2019-07-31 18:51 | |
| Messages (10) | |||
|---|---|---|---|
| msg348771 -(view) | Author: yannvgn (yannvgn)* | Date: 2019-07-30 22:02 | |
On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7Example (ipython):In [1]: import reIn [2]: char_list = ''.join([chr(i) for i in range(0xffff)])In [3]: long_char_list = char_list * 10In [4]: pattern = f'[{re.escape(long_char_list)}]'In [5]: %time compiled = re.compile(pattern)The test was run on Amazon Linux AMI 2017.03.On Python 3.6.1, the regexp compiled in 2.6 seconds:CPU times: user 2.59 s, sys: 30 ms, total: 2.62 sWall time: 2.64 sOn Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this case):CPU times: user 15min 6s, sys: 240 ms, total: 15min 7sWall time: 15min 9sDoing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6The complexity of this function is on average O(N^2) but can be easily reduced to O(N).The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression. | |||
| msg348789 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2019-07-31 07:58 | |
Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code?Could you please show benchmarking results for different implementations and different sizes? | |||
| msg348813 -(view) | Author: yannvgn (yannvgn)* | Date: 2019-07-31 16:28 | |
> Indeed, it was not expected that the character set contains hundreds of thousands items. What is its size in your real code?> Could you please show benchmarking results for different implementations and different sizes?I can't precisely answer that, but sacremoses (a tokenization package) for example is strongly impacted. Seehttps://github.com/alvations/sacremoses/issues/61#issuecomment-516401853 | |||
| msg348814 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2019-07-31 16:32 | |
Oh, this is convincing. | |||
| msg348817 -(view) | Author: Matthew Barnett (mrabarnett)*![]() | Date: 2019-07-31 17:23 | |
I've just had a look at _uniq, and the code surprises me.The obvious way to detect duplicates is with a set, but that requires the items to be hashable. Are they?Well, the first line of the function uses 'set', so they are.Why, then, isn't it using a set to detect the duplicates?How about this:def _uniq(items): newitems = [] seen = set() for item in items: if item not in seen: newitems.append(item) seen.add(item) return newitems | |||
| msg348821 -(view) | Author: yannvgn (yannvgn)* | Date: 2019-07-31 18:45 | |
Hey Matthew,we decided to go for this, which is simpler and straightforward:def _uniq(items): return list(dict.fromkeys(items))(seehttps://github.com/python/cpython/pull/15030) | |||
| msg348822 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2019-07-31 18:51 | |
New changeset9f55551f3df238e58315e724e50cb0d574d75b94 by Serhiy Storchaka (yannvgn) in branch 'master':bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)https://github.com/python/cpython/commit/9f55551f3df238e58315e724e50cb0d574d75b94 | |||
| msg348823 -(view) | Author: miss-islington (miss-islington) | Date: 2019-07-31 20:22 | |
New changeset33b700ba8cbb128519442eeed8c8747ff73f4524 by Miss Islington (bot) in branch '3.7':bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)https://github.com/python/cpython/commit/33b700ba8cbb128519442eeed8c8747ff73f4524 | |||
| msg348824 -(view) | Author: miss-islington (miss-islington) | Date: 2019-07-31 20:22 | |
New changeset77fcccb5321137456549b7f55b819f2c8a4c78a4 by Miss Islington (bot) in branch '3.8':bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)https://github.com/python/cpython/commit/77fcccb5321137456549b7f55b819f2c8a4c78a4 | |||
| msg348975 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2019-08-04 07:41 | |
Thank you for your contribution yannvgn. | |||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:59:18 | admin | set | github: 81904 |
| 2019-08-04 07:41:20 | serhiy.storchaka | set | status: open -> closed resolution: fixed messages: +msg348975 stage: patch review -> resolved |
| 2019-07-31 20:22:38 | miss-islington | set | nosy: +miss-islington messages: +msg348824 |
| 2019-07-31 20:22:38 | miss-islington | set | nosy: +miss-islington messages: +msg348823 |
| 2019-07-31 18:51:06 | serhiy.storchaka | set | messages: +msg348822 |
| 2019-07-31 18:51:01 | miss-islington | set | pull_requests: +pull_request14809 |
| 2019-07-31 18:50:54 | miss-islington | set | pull_requests: +pull_request14808 |
| 2019-07-31 18:45:08 | yannvgn | set | messages: +msg348821 |
| 2019-07-31 17:23:32 | mrabarnett | set | messages: +msg348817 |
| 2019-07-31 16:32:51 | serhiy.storchaka | set | messages: +msg348814 |
| 2019-07-31 16:28:12 | yannvgn | set | messages: +msg348813 |
| 2019-07-31 07:58:00 | serhiy.storchaka | set | messages: +msg348789 |
| 2019-07-31 06:24:39 | serhiy.storchaka | set | nosy: +serhiy.storchaka |
| 2019-07-30 22:03:50 | yannvgn | set | keywords: +patch stage: patch review pull_requests: +pull_request14788 |
| 2019-07-30 22:02:34 | yannvgn | create | |