Movatterモバイル変換
[0]ホーム
[Python-Dev] default encoding for 8-bit string literals (was Unicode and comparisons)
Guido van Rossumguido@python.org
Wed, 05 Apr 2000 18:37:24 -0400
[MAL]> > > u"..." currently interprets the characters it finds as Latin-1> > > (this is by design, since the first 256 Unicode ordinals map to> > > the Latin-1 characters).[GvR]> > Nice, except that now we seem to be ambiguous about the source> > character encoding: it's Latin-1 for Unicode strings and UTF-8 for> > 8-bit strings...![MAL]> Noo... there is no definition for non-ASCII 8-bit strings in> Python source code using the ordinal range 127-255. If you were> to define Latin-1 as source code encoding, then we would have> to change auto-coercion to make a Latin-1 assumption instead, but...> I see the picture: people are getting pretty confused about what> is going on.>> If you write u"xyz" then the ordinals of those characters are> taken and stored directly as Unicode characters. If you live> in a Latin-1 world, then you happen to be lucky: the Unicode> characters match your input. If not, some totally different> characters are likely to show if the string were written> to a file and displayed using a Unicode aware editor.>> The same will happen to your normal 8-bit string literals.> Nothing unusual so far... if you use Latin-1 strings and> write them to a file, you get Latin-1. If you happen to> program on DOS, you'll get the DOS ANSI encoding for the> German umlauts.>> Now the key point where all this started was that> u'ä' in 'äöü' will raise an error due to 'äöü' being> *interpreted* as UTF-8 -- this doesn't mean that 'äöü'> will be interpreted as UTF-8 elsewhere in your application.>> The UTF-8 assumption had to be made in order to get the two> worlds to interoperate. We could have just as well chosen> Latin-1, but then people currently using say a Russian> encoding would get upset for the same reason.>> One way or another somebody is not going to like whatever> we choose, I'm afraid... the simplest solution is to use> Unicode for all strings which contain non-ASCII characters> and then call .encode() as necessary.I have a different view on this (except that I agree that it's prettyconfusing :-). In my definition of a "source character encoding",string literals, whether Unicode or 8-bit strings, are translated fromthe source encoding to the corresponding run-time values. If I had aC compiler that read its source in EBCDIC but cross-compiled to amachine that used ASCII, I would expect that 'a' in the source wouldhave the integer value 97 (ASCII 'a'), regardless of the EBCDIC valuefor 'a'.If I type a non-ASCII Latin-1 character in a Unicode literal, itgenerates the corresponding Unicode character. This means to me thatthe source character encoding is Latin-1. But when I type the samecharacter in an 8-bit character literal, that literal is interpretedas UTF-8 (e.g. when converting to Unicode using the defaultconversions).Thus, even though you can do whatever you want with 8-bit literals inyour program, the most defensible view is that they are UTF-8encoded.I would be much happier if all source code was encoded in the sameencoding, because otherwise there's no good way to view such code in ageneral Unicode-aware text viewer!My preference would be to always use UTF-8. This would mean no changefor 8-bit literals, but a big change for Unicode literals... And abreak with everyone who's currently typing Latin-1 source code andusing strings as Latin-1. (Or Latin-7, or whatever.)My next preference would be a pragma to define the source encoding,but that's a 1.7 issue. Maybe the whole thing is... :-(--Guido van Rossum (home page:http://www.python.org/~guido/)
[8]ページ先頭