Movatterモバイル変換
[0]ホーム
[Python-Dev] textwrap and unicode
Greg Wardgward@python.net
Tue, 22 Oct 2002 16:24:08 -0400
On 22 October 2002, Martin v. Loewis said:> I don't know how precisely you want to formulate the property. If x is> a Unicode letter, then x.isspace() tells you whether it is a space> character (this property holds for all characters of the Zs category,> and all characters that have a bidirectionality of WS, B, or S).OK, then it's an implementation problem rather than a "you can't getthere from here" problem. Good. The reason I need a list of"whitespace chars" is to convert all whitespace to spaces; I usestring.maketrans() and s.translate() to do this efficiently:class TextWrapper: [...] whitespace_trans = string.maketrans(string.whitespace, ' ' * len(string.whitespace)) [...] def _munge_whitespace(self, text): """_munge_whitespace(text : string) -> string Munge whitespace in text: expand tabs and convert all other whitespace characters to spaces. Eg. " foo\tbar\n\nbaz" becomes " foo bar baz". """ if self.expand_tabs: text = text.expandtabs() if self.replace_whitespace: text = text.translate(self.whitespace_trans) return text(The rationale: having tabs and newlines in a paragraph about to bewrapped doesn't make any sense to me.)Ahh, OK, I'm starting to see the problem: there's nothing wrong with thetranslate() method of strings or unicode strings, but string.maketrans()doesn't generate a mapping that u''.translate() likes. Hmmmm.Right, now I've RTFD'd (read the fine docstring) for u''.translate().Here's what I've got now: whitespace_trans = string.maketrans(string.whitespace, ' ' * len(string.whitespace)) unicode_whitespace_trans = {} for c in string.whitespace: unicode_whitespace_trans[ord(unicode(c))] = ord(u' ') [...] def _munge_whitespace (self, text): [...] if self.replace_whitespace: if isinstance(text, str): text = text.translate(self.whitespace_trans) elif isinstance(text, unicode): text = text.translate(self.unicode_whitespace_trans)That's ugly as hell, but it works. Is there a cleaner way?The other bit of ASCII/English prejudice hardcoded into textwrap.py isthis regex: sentence_end_re = re.compile(r'[%s]' # lowercase letter r'[\.\!\?]' # sentence-ending punct. r'[\"\']?' # optional end-of-quote % string.lowercase)You may recall this from the kerfuffle over whether there should be twospaces after a sentence in fixed-width fonts. The feature is there, andoff by default, in TextWrapper. I'm not so concerned about this -- Imean, this doesn't even work with German or French, never mind Hebrew orChinese or Hindi. Apart from the narrow definition of "lowercaseletter", it has English punctuation conventions hardcoded into it. Butstill, it seems *awfully* dumb in this day and age to hardcodestring.lowercase into a regex that's meant to detect "lowercaseletters". But I couldn't find a better way to do it when I wrote thiscode last spring. Is there one?Thanks! Greg-- Greg Ward <gward@python.net>http://www.gerg.ca/OUR PLAN HAS FAILED STOP JOHN DENVER IS NOT TRULY DEAD STOPHE LIVES ON IN HIS MUSIC STOP PLEASE ADVISE FULL STOP
[8]ページ先頭