Movatterモバイル変換


[0]ホーム

URL:


homepage

Issue15027

This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title:Faster UTF-32 encoding
Type:performanceStage:resolved
Components:Interpreter Core, UnicodeVersions:Python 3.5
process
Status:closedResolution:fixed
Dependencies:Superseder:
Assigned To: serhiy.storchakaNosy List: Arfrever, BreamoreBoy, asvetlov, ezio.melotti, gregory.p.smith, kmike, larry, neologix, pitrou, python-dev, serhiy.storchaka, vstinner
Priority:normalKeywords:needs review, patch

Created on2012-06-07 13:57 byserhiy.storchaka, last changed2022-04-11 14:57 byadmin. This issue is nowclosed.

Files
File nameUploadedDescriptionEdit
encode_utf32_2.patchserhiy.storchaka,2012-10-20 19:05review
encode_utf32_3.patchserhiy.storchaka,2013-12-11 22:17review
Messages (21)
msg162474 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2012-06-07 13:57
In pair toissue14625 here is a patch than speed up UTF-32 encoding in several times. In addition, it fixes an unsafe check of an integer overflow.Here are the results of benchmarking. See benchmark tools inhttps://bitbucket.org/storchaka/cpython-stuff repository.On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz:Py2.7        Py3.2        Py3.3        patched541 (+1032%) 541 (+1032%) 844 (+626%)  6125   encode  utf-32le  'A'*10000543 (+1056%) 541 (+1060%) 844 (+643%)  6275   encode  utf-32le  '\x80'*10000544 (+1010%) 542 (+1014%) 843 (+616%)  6037   encode  utf-32le    '\x80'+'A'*9999541 (+799%)  542 (+797%)  764 (+537%)  4864   encode  utf-32le  '\u0100'*10000544 (+781%)  542 (+784%)  767 (+525%)  4793   encode  utf-32le    '\u0100'+'A'*9999544 (+789%)  542 (+792%)  766 (+531%)  4834   encode  utf-32le    '\u0100'+'\x80'*9999542 (+799%)  541 (+801%)  764 (+538%)  4874   encode  utf-32le  '\u8000'*10000544 (+779%)  542 (+782%)  767 (+523%)  4780   encode  utf-32le    '\u8000'+'A'*9999544 (+793%)  542 (+796%)  766 (+534%)  4859   encode  utf-32le    '\u8000'+'\x80'*9999544 (+819%)  542 (+823%)  766 (+553%)  5001   encode  utf-32le    '\u8000'+'\u0100'*9999430 (+867%)  427 (+874%)  860 (+383%)  4157   encode  utf-32le  '\U00010000'*10000543 (+655%)  543 (+655%)  861 (+376%)  4101   encode  utf-32le    '\U00010000'+'A'*9999543 (+658%)  543 (+658%)  861 (+378%)  4116   encode  utf-32le    '\U00010000'+'\x80'*9999543 (+670%)  543 (+670%)  859 (+387%)  4180   encode  utf-32le    '\U00010000'+'\u0100'*9999543 (+666%)  543 (+666%)  860 (+383%)  4158   encode  utf-32le    '\U00010000'+'\u8000'*9999541 (+880%)  543 (+876%)  844 (+528%)  5300   encode  utf-32be  'A'*10000541 (+872%)  542 (+870%)  844 (+523%)  5256   encode  utf-32be  '\x80'*10000544 (+843%)  542 (+846%)  843 (+509%)  5130   encode  utf-32be    '\x80'+'A'*9999541 (+363%)  542 (+362%)  764 (+228%)  2505   encode  utf-32be  '\u0100'*10000544 (+366%)  542 (+368%)  766 (+231%)  2534   encode  utf-32be    '\u0100'+'A'*9999544 (+363%)  542 (+365%)  766 (+229%)  2519   encode  utf-32be    '\u0100'+'\x80'*9999542 (+363%)  541 (+364%)  764 (+228%)  2509   encode  utf-32be  '\u8000'*10000544 (+366%)  542 (+368%)  766 (+231%)  2534   encode  utf-32be    '\u8000'+'A'*9999544 (+363%)  542 (+364%)  766 (+229%)  2517   encode  utf-32be    '\u8000'+'\x80'*9999544 (+372%)  542 (+374%)  766 (+235%)  2568   encode  utf-32be    '\u8000'+'\u0100'*9999430 (+428%)  427 (+432%)  860 (+164%)  2270   encode  utf-32be  '\U00010000'*10000543 (+317%)  541 (+318%)  861 (+163%)  2262   encode  utf-32be    '\U00010000'+'A'*9999543 (+320%)  541 (+321%)  861 (+165%)  2279   encode  utf-32be    '\U00010000'+'\x80'*9999543 (+322%)  541 (+323%)  859 (+167%)  2290   encode  utf-32be    '\U00010000'+'\u0100'*9999543 (+322%)  541 (+324%)  860 (+167%)  2292   encode  utf-32be    '\U00010000'+'\u8000'*9999
msg162823 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2012-06-14 20:30
On 32-bit Linux, Intel Atom N570 @ 1.66GHz:Py2.7        Py3.2        Py3.3        patched214 (+718%)  215 (+714%)  363 (+382%)  1750   encode  utf-32le  'A'*10000214 (+704%)  214 (+704%)  362 (+375%)  1720   encode  utf-32le  '\x80'*10000214 (+712%)  215 (+708%)  363 (+379%)  1738   encode  utf-32le    '\x80'+'A'*9999214 (+698%)  214 (+698%)  342 (+399%)  1707   encode  utf-32le  '\u0100'*10000214 (+688%)  215 (+684%)  343 (+392%)  1686   encode  utf-32le    '\u0100'+'A'*9999214 (+699%)  215 (+695%)  342 (+400%)  1710   encode  utf-32le    '\u0100'+'\x80'*9999214 (+694%)  214 (+694%)  342 (+397%)  1699   encode  utf-32le  '\u8000'*10000214 (+688%)  215 (+685%)  343 (+392%)  1687   encode  utf-32le    '\u8000'+'A'*9999214 (+700%)  214 (+700%)  342 (+401%)  1713   encode  utf-32le    '\u8000'+'\x80'*9999214 (+682%)  215 (+679%)  342 (+389%)  1674   encode  utf-32le    '\u8000'+'\u0100'*9999121 (+2237%) 121 (+2237%) 333 (+749%)  2828   encode  utf-32le  '\U00010000'*10000214 (+1108%) 214 (+1108%) 333 (+676%)  2585   encode  utf-32le    '\U00010000'+'A'*9999214 (+1112%) 214 (+1112%) 333 (+679%)  2594   encode  utf-32le    '\U00010000'+'\x80'*9999214 (+1208%) 214 (+1208%) 333 (+741%)  2799   encode  utf-32le    '\U00010000'+'\u0100'*9999214 (+1214%) 215 (+1208%) 333 (+745%)  2813   encode  utf-32le    '\U00010000'+'\u8000'*9999214 (+556%)  214 (+556%)  363 (+287%)  1404   encode  utf-32be  'A'*10000214 (+558%)  214 (+558%)  363 (+288%)  1408   encode  utf-32be  '\x80'*10000214 (+550%)  214 (+550%)  363 (+283%)  1390   encode  utf-32be    '\x80'+'A'*9999214 (+224%)  214 (+224%)  342 (+103%)  693    encode  utf-32be  '\u0100'*10000214 (+229%)  214 (+229%)  343 (+105%)  703    encode  utf-32be    '\u0100'+'A'*9999214 (+221%)  214 (+221%)  342 (+101%)  688    encode  utf-32be    '\u0100'+'\x80'*9999214 (+224%)  214 (+224%)  342 (+103%)  694    encode  utf-32be  '\u8000'*10000215 (+227%)  214 (+229%)  343 (+105%)  704    encode  utf-32be    '\u8000'+'A'*9999214 (+221%)  214 (+221%)  342 (+101%)  686    encode  utf-32be    '\u8000'+'\x80'*9999214 (+222%)  214 (+222%)  341 (+102%)  690    encode  utf-32be    '\u8000'+'\u0100'*9999121 (+387%)  121 (+387%)  333 (+77%)   589    encode  utf-32be  '\U00010000'*10000214 (+174%)  215 (+173%)  333 (+76%)   587    encode  utf-32be    '\U00010000'+'A'*9999214 (+183%)  214 (+183%)  333 (+82%)   606    encode  utf-32be    '\U00010000'+'\x80'*9999214 (+184%)  214 (+184%)  333 (+82%)   607    encode  utf-32be    '\U00010000'+'\u0100'*9999214 (+183%)  214 (+183%)  333 (+82%)   605    encode  utf-32be    '\U00010000'+'\u8000'*9999
msg173404 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2012-10-20 19:05
Patch updated to 3.4.Is anyone interested in 7x speedup of UTF-32 encoder?
msg205912 -(view)Author: Mark Lawrence (BreamoreBoy)*Date: 2013-12-11 18:28
Fromhttp://kmike.ru/python-data-structures/ under heading DATrie "Python wrapper uses utf_32_le codec internally; this codec is currently slow and it is the bottleneck for datrie. There is a ticket with a patch in the CPython bug tracker (http://bugs.python.org/issue15027) that should make this codec fast, so there is a hope datrie will become faster with future Pythons."
msg205934 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2013-12-11 22:17
Here is updated patch, synchronized with trunk. UTF-32 encoder now checks surrogates and therefore speedup is less (only up to 5 times). But this compensates regression in 3.4.On 32-bit Linux, Intel Atom N570 @ 1.66GHz:Py3.3        Py3.4        patched531 (+245%)  489 (+274%)  1831   encode  utf-32le  'A'*10000383 (+158%)  223 (+344%)  990    encode  utf-32le  '\u0100'*10000325 (+262%)  229 (+414%)  1177   encode  utf-32le  '\U00010000'*10000544 (+166%)  494 (+193%)  1448   encode  utf-32be  'A'*10000384 (+67%)   223 (+188%)  642    encode  utf-32be  '\u0100'*10000323 (+108%)  229 (+193%)  671    encode  utf-32be  '\U00010000'*10000
msg205940 -(view)Author: Gregory P. Smith (gregory.p.smith)*(Python committer)Date: 2013-12-11 23:05
one comment to address on the review, otherwise after addressing that I believe this is ready to go in for 3.4.
msg207292 -(view)Author: Roundup Robot (python-dev)(Python triager)Date: 2014-01-04 17:26
New changesetb72c5573c5e7 by Serhiy Storchaka in branch 'default':Issue#15027: Rewrite the UTF-32 encoder.  It is now 1.6x to 3.5x faster.http://hg.python.org/cpython/rev/b72c5573c5e7
msg207294 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2014-01-04 17:32
Thank you Gregory for your review.
msg207302 -(view)Author: Larry Hastings (larry)*(Python committer)Date: 2014-01-04 18:41
Isn't this a new feature?
msg207305 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2014-01-04 19:59
Sorry if I have missed. Should I revert changesetb72c5573c5e7?This patch doesn't introduce new functions and doesn't change behavior. Without this patch the UTF-32 encoder is up to 2.5x slower in 3.4 than in 3.3 (due toissue12892).
msg207306 -(view)Author: Larry Hastings (larry)*(Python committer)Date: 2014-01-04 20:10
Would you describe it as a "bug fix" or a "security fix"?  If it's neither of those things, then you need special permission to add it during beta.  And given that this patch has the possibility of causing bugs, I'd prefer to not accept it for 3.4.Please revert it for now.  If you think it should go in to 3.4, you may ask on python-dev that it be considered and take a poll.  (Note that the poll is not binding on me; this is still solely my decision.  However if there was an uproar of support for your patch, that would certainly cause me to reconsider.)
msg207311 -(view)Author: Roundup Robot (python-dev)(Python triager)Date: 2014-01-04 20:51
New changeset1e345924f7ea by Serhiy Storchaka in branch 'default':Reverted changesetb72c5573c5e7 (issue#15027).http://hg.python.org/cpython/rev/1e345924f7ea
msg210147 -(view)Author: Larry Hastings (larry)*(Python committer)Date: 2014-02-03 16:02
BreamoreBoy: why did you remove Arfrever from this issue?
msg210148 -(view)Author: Charles-François Natali (neologix)*(Python committer)Date: 2014-02-03 16:28
> BreamoreBoy: why did you remove Arfrever from this issue?Noisy lists members are sorted by alphabetical order: since Arfrever comes just before BreamoreBoy, I assume his fingers tripped ;-)
msg242871 -(view)Author: Mark Lawrence (BreamoreBoy)*Date: 2015-05-10 23:21
As this appears to be a performance improvement only can it go into 3.5 or do we wait for 3.x?
msg242954 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2015-05-12 10:22
Can I commit the patch now Larry?
msg242981 -(view)Author: Larry Hastings (larry)*(Python committer)Date: 2015-05-12 15:43
We're still in alpha, so it's fine for 3.5 right now.  The cutoff for new features for 3.5 will be May 23.
msg243005 -(view)Author: Roundup Robot (python-dev)(Python triager)Date: 2015-05-12 20:13
New changeset80cf7723c4cf by Serhiy Storchaka in branch 'default':Issue#15027: The UTF-32 encoder is now 3x to 7x faster.https://hg.python.org/cpython/rev/80cf7723c4cf
msg243008 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2015-05-12 20:26
And that's not all...
msg243523 -(view)Author: Arfrever Frehtes Taifersar Arahesis (Arfrever)*(Python triager)Date: 2015-05-18 19:14
InObjects/stringlib/codecs.h in 2 comments U+DC800 should be changed into U+D800 (from definition of Py_UNICODE_IS_SURROGATE) or U+DC80 (from result of b"\x80".decode(errors="surrogateescape")).
msg243524 -(view)Author: Serhiy Storchaka (serhiy.storchaka)*(Python committer)Date: 2015-05-18 19:22
Thank you Arfrever. That was copy-pasted old typo. Fixed in3d5bf6174c4b andbc6ed8360312.
History
DateUserActionArgs
2022-04-11 14:57:31adminsetgithub: 59232
2015-05-18 19:22:44serhiy.storchakasetmessages: +msg243524
2015-05-18 19:14:25Arfreversetmessages: +msg243523
2015-05-12 20:26:46serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: +msg243008

stage: patch review -> resolved
2015-05-12 20:13:08python-devsetmessages: +msg243005
2015-05-12 15:43:59larrysetmessages: +msg242981
2015-05-12 10:22:08serhiy.storchakasetmessages: +msg242954
2015-05-10 23:21:30BreamoreBoysetnosy: +BreamoreBoy
messages: +msg242871
2014-02-03 17:00:17BreamoreBoysetnosy: -BreamoreBoy
2014-02-03 16:28:02neologixsetnosy: +Arfrever,neologix
messages: +msg210148
2014-02-03 16:02:10larrysetmessages: +msg210147
2014-02-03 15:38:19BreamoreBoysetnosy: -Arfrever
2014-01-04 20:55:35serhiy.storchakasetstatus: closed -> open
stage: resolved -> patch review
resolution: fixed -> (no value)
versions: + Python 3.5, - Python 3.4
2014-01-04 20:51:11python-devsetmessages: +msg207311
2014-01-04 20:10:40larrysetmessages: +msg207306
2014-01-04 19:59:30serhiy.storchakasetmessages: +msg207305
2014-01-04 18:41:17larrysetnosy: +larry
messages: +msg207302
2014-01-04 17:32:35serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: +msg207294

stage: patch review -> resolved
2014-01-04 17:26:00python-devsetnosy: +python-dev
messages: +msg207292
2013-12-11 23:05:08gregory.p.smithsetpriority: low -> normal
nosy: +gregory.p.smith
messages: +msg205940

2013-12-11 22:17:02serhiy.storchakasetfiles: +encode_utf32_3.patch

messages: +msg205934
2013-12-11 18:28:28BreamoreBoysetnosy: +BreamoreBoy
messages: +msg205912
2013-01-07 17:51:10serhiy.storchakasetpriority: normal -> low
assignee:serhiy.storchaka
2012-10-24 09:02:58serhiy.storchakasetstage: patch review
2012-10-20 19:05:07serhiy.storchakasetkeywords: +needs review
files: +encode_utf32_2.patch
messages: +msg173404

versions: + Python 3.4, - Python 3.3
2012-10-20 19:03:19serhiy.storchakasetfiles: -encode-utf32.patch
2012-07-17 20:44:37kmikesetnosy: +kmike
2012-06-14 20:30:11serhiy.storchakasetmessages: +msg162823
2012-06-07 13:57:30serhiy.storchakacreate
Supported byThe Python Software Foundation,
Powered byRoundup
Copyright © 1990-2022,Python Software Foundation
Legal Statements

[8]ページ先頭

©2009-2026 Movatter.jp