Movatterモバイル変換


[0]ホーム

URL:


[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak()(???)

Walter Dörwaldwalter at livinglogic.de
Wed Aug 24 22:14:32 CEST 2005


Am 24.08.2005 um 21:15 schrieb Martin v. Löwis:> Walter Dörwald wrote:>>>>> Right. Not sure what people think whether this should still be>>> supported, but I keep supporting it whenever I think of it.>>>>>>> OK, so should we add this for 2.4.2 or only for 2.5?>>>> You mean, string.unicodelinebreaks?>Yes.> I think something needs to be> done to fix the performance problem. In doing so, API changes> might occur. We should not add API changes in 2.4.2 unless they> contribute to the bug fix, and even then, the release manager> probably needs to approve them (in any case, they certainly> need to be backwards compatible)>OK. Your version of the patch (without replacing line =  line.splitlines(False)[0] with something better) might be enough for  2.4.2.>> Should this really be put into string.py, or should it be a class>> attribute of unicode? (At least that's what was proposed for the>> other>> strings in string.py (string.whitespace etc.) too.>>>> If the 2.4.2 fix is based on this kind of data, I think it should go> into a private attribute of codecs.py.>I think codecs.unicodelinebreaks has one big problem: it will not  work for codecs that do str->str decoding.> For 2.5, I would put it> into strings for tradition. There is no point in having some of these> constants in strings and others as class attributes (unless we also> add them as class attributes in 2.5, in which case adding> unicodelinebreaks into strings would be pointless).>> So I think in 2.5, I would like to see>> # string.py> ascii_letters = str.ascii_letters>> in which case unicode.linebreaks would be the right spelling.>And it would have the advantage, that it could work both with str and  unicode if we had both str.linebreaks and unicode.linebreaks>>> I'm not so sure anymore. It is good for consistency, but I doubt>>> there>>> are actual use cases: how often do you want only the first n lines>>> of some string? Reading the first n lines of a file might be an>>> application, but then, you would rather use .readline() directly.>>>>>>> Not every unicode string is read from a StreamReader.>>>> Sure: but how often do you want to fetch the first line of a Unicode> string you happen to have in memory, without iterating over all lines> eventually?>I don't know. The only obvious spot in the standard library (apart  from codecs.py) seems to be    def shortdescription(self): return self.description().splitlines() [0]in Lib/plat-mac/pimp.py>> Another solution would be to have a unicode.itersplitlines() and>> store>> the iterator. Then we wouldn't need a maxsplit because you simply can>> stop iterating once you have what you want.>>>> That might work. I would then ask for itersplitlines to return pairs> of (line, truncated) so you can easily know whether you merely ran> into the end of the string, or whether you got a complete line> (although it might be a bit too specific for the readlines() case)>Or maybe (line, terminatorlength) which gives you the same info  (terminatorlength == 0 means truncated) and makes it easy to strip  the terminator.>> So reverting to the 2.3 behaviour for simple codecs is out?>>>> I'm -1, atleast. It would also fix the problem at hand, for the> reported> case. However, it does leave some codecs in the cold, most notably> UTF-8 (which, in turn, isn't an issue for PEP 262, since UTF-8 is> built-in in the parser).>You meant PEP 263, right?> I think the UTF-8 stream reader should support> all Unicode line breaks, so it should continue to use the Python> approach.>OK.> However, UTF-8 is fairly common, so that reading an> UTF-8-encoded file line-by-line shouldn't suck.>OK, so what's missing is a solution for str->str codecs (or we keep  line = line.splitlines(False)[0] and test, whether this is fast enough).Bye,    Walter Dörwald


More information about the Python-Devmailing list

[8]ページ先頭

©2009-2025 Movatter.jp