Movatterモバイル変換


[0]ホーム

URL:


[Python-Dev] Unicode debate

Paul Prescodpaul@prescod.net
Thu, 27 Apr 2000 21:20:22 -0500


Guido van Rossum wrote:>> ...>> I've heard a few people claim that strings should always be considered> to contain "characters" and that there should be one character per> string element.  I've also heard a clamoring that there should only be> one string type.  You folks have never used Asian encodings.  In> countries like Japan, China and Korea, encodings are a fact of life,> and the most popular encodings are ASCII supersets that use a variable> number of bytes per character, just like UTF-8.  Each country or> language uses different encodings, even though their characters look> mostly the same to western eyes.  UTF-8 and Unicode is having a hard> time getting adopted in these countries because most software that> people use deals only with the local encodings.  (Sounds familiar?)I think that maybe an important point is getting lost here. I could bewrong, but it seems that all of this emphasis on encodings is misplaced.The physical and logical makeup of character strings are entirelyseparate issues. Unicode is a character set. It works in the logicaldomain.Dozens of different physical encodings can be used for Unicodecharacters. There are XML users who work with XML (and thus Unicode)every day and never see UTF-8, UTF-16 or any other Unicode-consortium"sponsored" encoding. If you invent an encoding tomorrow, it can stillbe XML-compatible. There are many encodings older than Unicode that areXML (and Unicode) compatible.I have not heard complaints about the XML way of looking at the worldand in fact it was explicitly endorsed by many of the world's leadingexperts on internationalization. I haven't followed the Java situationas closely but I have also not heard screams about its support for il8n. > The truth of the matter is: the encoding of string objects is in the> mind of the programmer.  When I read a GIF file into a string object,> the encoding is "binary goop".IMHO, it's a mistake of history that you would even think it makes senseto read a GIF file into a "string" object and we should be trying toerase that mistake, as quickly as possible (which is admittedly not veryquickly) not building more and more infrastructure around it. How can wemake the transition to a "binary goops are not strings" world easiest?> The moral of all this?  8-bit strings are not going away.If that is a statement of your long term vision, then I think that it isvery unfortunate. Treating string literals as if they were isomorphicwith byte arrays was probably the right thing in 1991 but it won't be in2005.It doesn't meet the definition of string used in the Unicode spec., norin XML, nor in Java, nor at the W3C nor in most other up and comingspecifications.From the W3C site:""While ISO-2022-JP is not sufficient for every ISO10646 document, it isthe case that ISO10646 is a sufficient document character set for anyentity encoded with ISO-2022-JP.""http://www.w3.org/MarkUp/html-spec/charset-harmful.html--  Paul Prescod  - ISOGEN Consulting Engineer speaking for himselfIt's difficult to extract sense from strings, but they're the onlycommunication coin we can count on. -http://www.cs.yale.edu/~perlis-alan/quotes.html


[8]ページ先頭

©2009-2025 Movatter.jp