Movatterモバイル変換


[0]ホーム

URL:


[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

"Martin v. Löwis"martin at v.loewis.de
Wed Aug 24 12:16:25 CEST 2005


Walter Dörwald wrote:> This is caused by the chances to the codecs in 2.4. Basically the codecs> no longer rely on C's readline() to do line splitting (which can't work> for UTF-16), but do it themselves (via unicode.splitlines()).That explains why you get any calls to IsLineBreak; it doesn't explainwhy you get so many of them.I investigated this a bit, and one issue seems to be thatStreamReader.readline performs splitline on the entire input, only tofetch the first line. It then joins the rest for later processing.In addition, it also performs splitlines on a single line, just tostrip any trailing line breaks.The net effect is that, for a file with N lines, IsLineBreak is invokedup to N*N/2 times per character (atleast for the last character).So I think it would be best if Unicode characters exposed a .islinebreakmethod (or, failing that, codecs just knew what the line breakcharacters are in Unicode 3.2), and then codecs would split offthe first line of input itself.>>After doing some gprof profiling, I discovered _PyUnicodeUCS2_IsLinebreak was>>getting called 51 million times. Our code is 1.2 million characters, so I>>hardly think it makes sense to call IsLinebreak 50 times for each character;>>and we're not even importing our entire source tree on every invocation.>>> But if you're using CGI, you're importing your source on every> invocation.Well, no. Only the CGI script needs to be parsed every time; all modulescould load off bytecode files.Which suggests that Keir Mierle doesn't use bytecode files, I think heshould.Regards,Martin


More information about the Python-Devmailing list

[8]ページ先頭

©2009-2025 Movatter.jp