Movatterモバイル変換


[0]ホーム

URL:


[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Walter Dörwaldwalter at livinglogic.de
Wed Aug 24 18:59:11 CEST 2005


Martin v. Löwis wrote:> Walter Dörwald wrote:>>>Martin v. Löwis wrote:>>>>>Walter Dörwald wrote:>>>>>>>I think a maxsplit argument (just as for unicode.split()) would help>>>>too.>>>>>>Correct - that would allow to get rid of the quadratic part.>>>>OK, such a patch should be rather simple. I'll give it a try.>> Actually, on a second thought - it would not remove the quadratic> aspect.At least it would remove the quadratic number of calls to _PyUnicodeUCS2_IsLinebreak(). For each character it would be called only once.> You would still copy the rest string completely on each> split. So on the first split, you copy N lines (one result line,> and N-1 lines into the rest string), on the second split, N-2> lines, and so on, totalling N*N/2 line copies again.OK, that's true.We could prevent string copying if we kept the unsplit string and the position of the current line terminator, but this would require a "first position after a line terminator" method.> The only> thing you save is the join (as the rest is already joined), and> the IsLineBreak calls (which are necessary only for the first> line).>> Please see python.org/sf/1268314;The last part of the patch seems to be more related to bug #1235646.With the patch test_pep263 and test_codecs fail (and test_parser, but this might be unrelated):python Lib/test/test_pep263.py gives the following output:File "Lib/test/test_pep263.py", line 22SyntaxError: list index out of rangetest_codecs.py has the following two complaints:File "/var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py", line 366, in readline     self.charbuffer = lines[1] + self.charbufferIndexError: list index out of rangeandFile "/var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py", line 336, in readline     line = result.splitlines(False)[0]NameError: global name 'result' is not defined> it solves the problem by> keeping the splitlines result. It only invokes IsLineBreak> once per character, and also copies each character only once,> and allocates each line only once, totalling in O(N) for> these operations. It still does contain a quadratic operation:> the lines are stored in a list, and the result list is> removed from the list with del lines[0]. This copies N-1> pointers, result in N*N/2 pointer copies. That should still> be much faster than the current code.Using collections.deque() should get rid of this problem.>>unicodelinebreaks = u"".join(unichr(c) for c in xrange(0,>>sys.maxunicode) if unichr(c).islinebreak())>> That is very inefficient. I would rather add a static list> to the string module, and have a test that says>> assert str.unicodelinebreaks == u"".join(ch for ch in (unichr(c) for c> in xrange(0, sys.maxunicode)) if unicodedata.bidirectional(ch)=='B' or> unicodedata.category(ch)=='Zl')You mean, in the test suite?> unicodelinebreaks could then be defined as>> # u"\r\n\x1c\x1d\x1e\x85\u2028\u2029> '\n\r\x1c\x1d\x1e\xc2\x85\xe2\x80\xa8\xe2\x80\xa9'.decode("utf-8")That might be better, as this definition won't change very often.BTW, why the decode() call? For a Python without unicode?>>OK, this would mean we'd have to distinguish between a direct call to>>read() and one done by readline() (which we do anyway through the>>firstline argument).>> See my patch. If we have cached lines, we don't need to call .read> at all.I wonder what happens, if calls to read() and readline() are mixed (e.g. if I'm reading Fortran source or anything with a fixed line header): read() would be used to read the first n character (which joins the line buffer) and readline() reads the rest (which would split it again) etc.(Of course this could be done via a single readline() call).But, I think a maxsplit argument for splitlines() woould make sense independent of this problem.Bye,    Walter Dörwald


More information about the Python-Devmailing list

[8]ページ先頭

©2009-2025 Movatter.jp