
This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.
Created on2012-06-07 13:57 byserhiy.storchaka, last changed2022-04-11 14:57 byadmin. This issue is nowclosed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| encode_utf32_2.patch | serhiy.storchaka,2012-10-20 19:05 | review | ||
| encode_utf32_3.patch | serhiy.storchaka,2013-12-11 22:17 | review | ||
| Messages (21) | |||
|---|---|---|---|
| msg162474 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2012-06-07 13:57 | |
In pair toissue14625 here is a patch than speed up UTF-32 encoding in several times. In addition, it fixes an unsafe check of an integer overflow.Here are the results of benchmarking. See benchmark tools inhttps://bitbucket.org/storchaka/cpython-stuff repository.On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz:Py2.7 Py3.2 Py3.3 patched541 (+1032%) 541 (+1032%) 844 (+626%) 6125 encode utf-32le 'A'*10000543 (+1056%) 541 (+1060%) 844 (+643%) 6275 encode utf-32le '\x80'*10000544 (+1010%) 542 (+1014%) 843 (+616%) 6037 encode utf-32le '\x80'+'A'*9999541 (+799%) 542 (+797%) 764 (+537%) 4864 encode utf-32le '\u0100'*10000544 (+781%) 542 (+784%) 767 (+525%) 4793 encode utf-32le '\u0100'+'A'*9999544 (+789%) 542 (+792%) 766 (+531%) 4834 encode utf-32le '\u0100'+'\x80'*9999542 (+799%) 541 (+801%) 764 (+538%) 4874 encode utf-32le '\u8000'*10000544 (+779%) 542 (+782%) 767 (+523%) 4780 encode utf-32le '\u8000'+'A'*9999544 (+793%) 542 (+796%) 766 (+534%) 4859 encode utf-32le '\u8000'+'\x80'*9999544 (+819%) 542 (+823%) 766 (+553%) 5001 encode utf-32le '\u8000'+'\u0100'*9999430 (+867%) 427 (+874%) 860 (+383%) 4157 encode utf-32le '\U00010000'*10000543 (+655%) 543 (+655%) 861 (+376%) 4101 encode utf-32le '\U00010000'+'A'*9999543 (+658%) 543 (+658%) 861 (+378%) 4116 encode utf-32le '\U00010000'+'\x80'*9999543 (+670%) 543 (+670%) 859 (+387%) 4180 encode utf-32le '\U00010000'+'\u0100'*9999543 (+666%) 543 (+666%) 860 (+383%) 4158 encode utf-32le '\U00010000'+'\u8000'*9999541 (+880%) 543 (+876%) 844 (+528%) 5300 encode utf-32be 'A'*10000541 (+872%) 542 (+870%) 844 (+523%) 5256 encode utf-32be '\x80'*10000544 (+843%) 542 (+846%) 843 (+509%) 5130 encode utf-32be '\x80'+'A'*9999541 (+363%) 542 (+362%) 764 (+228%) 2505 encode utf-32be '\u0100'*10000544 (+366%) 542 (+368%) 766 (+231%) 2534 encode utf-32be '\u0100'+'A'*9999544 (+363%) 542 (+365%) 766 (+229%) 2519 encode utf-32be '\u0100'+'\x80'*9999542 (+363%) 541 (+364%) 764 (+228%) 2509 encode utf-32be '\u8000'*10000544 (+366%) 542 (+368%) 766 (+231%) 2534 encode utf-32be '\u8000'+'A'*9999544 (+363%) 542 (+364%) 766 (+229%) 2517 encode utf-32be '\u8000'+'\x80'*9999544 (+372%) 542 (+374%) 766 (+235%) 2568 encode utf-32be '\u8000'+'\u0100'*9999430 (+428%) 427 (+432%) 860 (+164%) 2270 encode utf-32be '\U00010000'*10000543 (+317%) 541 (+318%) 861 (+163%) 2262 encode utf-32be '\U00010000'+'A'*9999543 (+320%) 541 (+321%) 861 (+165%) 2279 encode utf-32be '\U00010000'+'\x80'*9999543 (+322%) 541 (+323%) 859 (+167%) 2290 encode utf-32be '\U00010000'+'\u0100'*9999543 (+322%) 541 (+324%) 860 (+167%) 2292 encode utf-32be '\U00010000'+'\u8000'*9999 | |||
| msg162823 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2012-06-14 20:30 | |
On 32-bit Linux, Intel Atom N570 @ 1.66GHz:Py2.7 Py3.2 Py3.3 patched214 (+718%) 215 (+714%) 363 (+382%) 1750 encode utf-32le 'A'*10000214 (+704%) 214 (+704%) 362 (+375%) 1720 encode utf-32le '\x80'*10000214 (+712%) 215 (+708%) 363 (+379%) 1738 encode utf-32le '\x80'+'A'*9999214 (+698%) 214 (+698%) 342 (+399%) 1707 encode utf-32le '\u0100'*10000214 (+688%) 215 (+684%) 343 (+392%) 1686 encode utf-32le '\u0100'+'A'*9999214 (+699%) 215 (+695%) 342 (+400%) 1710 encode utf-32le '\u0100'+'\x80'*9999214 (+694%) 214 (+694%) 342 (+397%) 1699 encode utf-32le '\u8000'*10000214 (+688%) 215 (+685%) 343 (+392%) 1687 encode utf-32le '\u8000'+'A'*9999214 (+700%) 214 (+700%) 342 (+401%) 1713 encode utf-32le '\u8000'+'\x80'*9999214 (+682%) 215 (+679%) 342 (+389%) 1674 encode utf-32le '\u8000'+'\u0100'*9999121 (+2237%) 121 (+2237%) 333 (+749%) 2828 encode utf-32le '\U00010000'*10000214 (+1108%) 214 (+1108%) 333 (+676%) 2585 encode utf-32le '\U00010000'+'A'*9999214 (+1112%) 214 (+1112%) 333 (+679%) 2594 encode utf-32le '\U00010000'+'\x80'*9999214 (+1208%) 214 (+1208%) 333 (+741%) 2799 encode utf-32le '\U00010000'+'\u0100'*9999214 (+1214%) 215 (+1208%) 333 (+745%) 2813 encode utf-32le '\U00010000'+'\u8000'*9999214 (+556%) 214 (+556%) 363 (+287%) 1404 encode utf-32be 'A'*10000214 (+558%) 214 (+558%) 363 (+288%) 1408 encode utf-32be '\x80'*10000214 (+550%) 214 (+550%) 363 (+283%) 1390 encode utf-32be '\x80'+'A'*9999214 (+224%) 214 (+224%) 342 (+103%) 693 encode utf-32be '\u0100'*10000214 (+229%) 214 (+229%) 343 (+105%) 703 encode utf-32be '\u0100'+'A'*9999214 (+221%) 214 (+221%) 342 (+101%) 688 encode utf-32be '\u0100'+'\x80'*9999214 (+224%) 214 (+224%) 342 (+103%) 694 encode utf-32be '\u8000'*10000215 (+227%) 214 (+229%) 343 (+105%) 704 encode utf-32be '\u8000'+'A'*9999214 (+221%) 214 (+221%) 342 (+101%) 686 encode utf-32be '\u8000'+'\x80'*9999214 (+222%) 214 (+222%) 341 (+102%) 690 encode utf-32be '\u8000'+'\u0100'*9999121 (+387%) 121 (+387%) 333 (+77%) 589 encode utf-32be '\U00010000'*10000214 (+174%) 215 (+173%) 333 (+76%) 587 encode utf-32be '\U00010000'+'A'*9999214 (+183%) 214 (+183%) 333 (+82%) 606 encode utf-32be '\U00010000'+'\x80'*9999214 (+184%) 214 (+184%) 333 (+82%) 607 encode utf-32be '\U00010000'+'\u0100'*9999214 (+183%) 214 (+183%) 333 (+82%) 605 encode utf-32be '\U00010000'+'\u8000'*9999 | |||
| msg173404 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2012-10-20 19:05 | |
Patch updated to 3.4.Is anyone interested in 7x speedup of UTF-32 encoder? | |||
| msg205912 -(view) | Author: Mark Lawrence (BreamoreBoy)* | Date: 2013-12-11 18:28 | |
Fromhttp://kmike.ru/python-data-structures/ under heading DATrie "Python wrapper uses utf_32_le codec internally; this codec is currently slow and it is the bottleneck for datrie. There is a ticket with a patch in the CPython bug tracker (http://bugs.python.org/issue15027) that should make this codec fast, so there is a hope datrie will become faster with future Pythons." | |||
| msg205934 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2013-12-11 22:17 | |
Here is updated patch, synchronized with trunk. UTF-32 encoder now checks surrogates and therefore speedup is less (only up to 5 times). But this compensates regression in 3.4.On 32-bit Linux, Intel Atom N570 @ 1.66GHz:Py3.3 Py3.4 patched531 (+245%) 489 (+274%) 1831 encode utf-32le 'A'*10000383 (+158%) 223 (+344%) 990 encode utf-32le '\u0100'*10000325 (+262%) 229 (+414%) 1177 encode utf-32le '\U00010000'*10000544 (+166%) 494 (+193%) 1448 encode utf-32be 'A'*10000384 (+67%) 223 (+188%) 642 encode utf-32be '\u0100'*10000323 (+108%) 229 (+193%) 671 encode utf-32be '\U00010000'*10000 | |||
| msg205940 -(view) | Author: Gregory P. Smith (gregory.p.smith)*![]() | Date: 2013-12-11 23:05 | |
one comment to address on the review, otherwise after addressing that I believe this is ready to go in for 3.4. | |||
| msg207292 -(view) | Author: Roundup Robot (python-dev)![]() | Date: 2014-01-04 17:26 | |
New changesetb72c5573c5e7 by Serhiy Storchaka in branch 'default':Issue#15027: Rewrite the UTF-32 encoder. It is now 1.6x to 3.5x faster.http://hg.python.org/cpython/rev/b72c5573c5e7 | |||
| msg207294 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2014-01-04 17:32 | |
Thank you Gregory for your review. | |||
| msg207302 -(view) | Author: Larry Hastings (larry)*![]() | Date: 2014-01-04 18:41 | |
Isn't this a new feature? | |||
| msg207305 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2014-01-04 19:59 | |
Sorry if I have missed. Should I revert changesetb72c5573c5e7?This patch doesn't introduce new functions and doesn't change behavior. Without this patch the UTF-32 encoder is up to 2.5x slower in 3.4 than in 3.3 (due toissue12892). | |||
| msg207306 -(view) | Author: Larry Hastings (larry)*![]() | Date: 2014-01-04 20:10 | |
Would you describe it as a "bug fix" or a "security fix"? If it's neither of those things, then you need special permission to add it during beta. And given that this patch has the possibility of causing bugs, I'd prefer to not accept it for 3.4.Please revert it for now. If you think it should go in to 3.4, you may ask on python-dev that it be considered and take a poll. (Note that the poll is not binding on me; this is still solely my decision. However if there was an uproar of support for your patch, that would certainly cause me to reconsider.) | |||
| msg207311 -(view) | Author: Roundup Robot (python-dev)![]() | Date: 2014-01-04 20:51 | |
New changeset1e345924f7ea by Serhiy Storchaka in branch 'default':Reverted changesetb72c5573c5e7 (issue#15027).http://hg.python.org/cpython/rev/1e345924f7ea | |||
| msg210147 -(view) | Author: Larry Hastings (larry)*![]() | Date: 2014-02-03 16:02 | |
BreamoreBoy: why did you remove Arfrever from this issue? | |||
| msg210148 -(view) | Author: Charles-François Natali (neologix)*![]() | Date: 2014-02-03 16:28 | |
> BreamoreBoy: why did you remove Arfrever from this issue?Noisy lists members are sorted by alphabetical order: since Arfrever comes just before BreamoreBoy, I assume his fingers tripped ;-) | |||
| msg242871 -(view) | Author: Mark Lawrence (BreamoreBoy)* | Date: 2015-05-10 23:21 | |
As this appears to be a performance improvement only can it go into 3.5 or do we wait for 3.x? | |||
| msg242954 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2015-05-12 10:22 | |
Can I commit the patch now Larry? | |||
| msg242981 -(view) | Author: Larry Hastings (larry)*![]() | Date: 2015-05-12 15:43 | |
We're still in alpha, so it's fine for 3.5 right now. The cutoff for new features for 3.5 will be May 23. | |||
| msg243005 -(view) | Author: Roundup Robot (python-dev)![]() | Date: 2015-05-12 20:13 | |
New changeset80cf7723c4cf by Serhiy Storchaka in branch 'default':Issue#15027: The UTF-32 encoder is now 3x to 7x faster.https://hg.python.org/cpython/rev/80cf7723c4cf | |||
| msg243008 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2015-05-12 20:26 | |
And that's not all... | |||
| msg243523 -(view) | Author: Arfrever Frehtes Taifersar Arahesis (Arfrever)*![]() | Date: 2015-05-18 19:14 | |
InObjects/stringlib/codecs.h in 2 comments U+DC800 should be changed into U+D800 (from definition of Py_UNICODE_IS_SURROGATE) or U+DC80 (from result of b"\x80".decode(errors="surrogateescape")). | |||
| msg243524 -(view) | Author: Serhiy Storchaka (serhiy.storchaka)*![]() | Date: 2015-05-18 19:22 | |
Thank you Arfrever. That was copy-pasted old typo. Fixed in3d5bf6174c4b andbc6ed8360312. | |||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:57:31 | admin | set | github: 59232 |
| 2015-05-18 19:22:44 | serhiy.storchaka | set | messages: +msg243524 |
| 2015-05-18 19:14:25 | Arfrever | set | messages: +msg243523 |
| 2015-05-12 20:26:46 | serhiy.storchaka | set | status: open -> closed resolution: fixed messages: +msg243008 stage: patch review -> resolved |
| 2015-05-12 20:13:08 | python-dev | set | messages: +msg243005 |
| 2015-05-12 15:43:59 | larry | set | messages: +msg242981 |
| 2015-05-12 10:22:08 | serhiy.storchaka | set | messages: +msg242954 |
| 2015-05-10 23:21:30 | BreamoreBoy | set | nosy: +BreamoreBoy messages: +msg242871 |
| 2014-02-03 17:00:17 | BreamoreBoy | set | nosy: -BreamoreBoy |
| 2014-02-03 16:28:02 | neologix | set | nosy: +Arfrever,neologix messages: +msg210148 |
| 2014-02-03 16:02:10 | larry | set | messages: +msg210147 |
| 2014-02-03 15:38:19 | BreamoreBoy | set | nosy: -Arfrever |
| 2014-01-04 20:55:35 | serhiy.storchaka | set | status: closed -> open stage: resolved -> patch review resolution: fixed -> (no value) versions: + Python 3.5, - Python 3.4 |
| 2014-01-04 20:51:11 | python-dev | set | messages: +msg207311 |
| 2014-01-04 20:10:40 | larry | set | messages: +msg207306 |
| 2014-01-04 19:59:30 | serhiy.storchaka | set | messages: +msg207305 |
| 2014-01-04 18:41:17 | larry | set | nosy: +larry messages: +msg207302 |
| 2014-01-04 17:32:35 | serhiy.storchaka | set | status: open -> closed resolution: fixed messages: +msg207294 stage: patch review -> resolved |
| 2014-01-04 17:26:00 | python-dev | set | nosy: +python-dev messages: +msg207292 |
| 2013-12-11 23:05:08 | gregory.p.smith | set | priority: low -> normal nosy: +gregory.p.smith messages: +msg205940 |
| 2013-12-11 22:17:02 | serhiy.storchaka | set | files: +encode_utf32_3.patch messages: +msg205934 |
| 2013-12-11 18:28:28 | BreamoreBoy | set | nosy: +BreamoreBoy messages: +msg205912 |
| 2013-01-07 17:51:10 | serhiy.storchaka | set | priority: normal -> low assignee:serhiy.storchaka |
| 2012-10-24 09:02:58 | serhiy.storchaka | set | stage: patch review |
| 2012-10-20 19:05:07 | serhiy.storchaka | set | keywords: +needs review files: +encode_utf32_2.patch messages: +msg173404 versions: + Python 3.4, - Python 3.3 |
| 2012-10-20 19:03:19 | serhiy.storchaka | set | files: -encode-utf32.patch |
| 2012-07-17 20:44:37 | kmike | set | nosy: +kmike |
| 2012-06-14 20:30:11 | serhiy.storchaka | set | messages: +msg162823 |
| 2012-06-07 13:57:30 | serhiy.storchaka | create | |