Movatterモバイル変換

[Python-Dev] Encoding of 8-bit strings and Python source code

Fredrik LundhFredrik Lundh" <effbot@telia.com
Tue, 25 Apr 2000 17:16:25 +0200

Previous message:[Python-Dev] Encoding of 8-bit strings and Python source code
Next message:[Python-Dev] Encoding of 8-bit strings and Python source code
Messages sorted by:[ date ][ thread ][ subject ][ author ]

I'll follow up with a longer reply later; just one correction:M.-A. Lemburg <mal@lemburg.com> wrote:> Ad 1. UTF-8 is used as basis in many other languages such=20> as TCL or Perl.  It is not an intuitive way of> writing strings and causes problems due to one character> spanning 1-6 bytes. Still, the world seems to be moving> into this direction, so going the same way can't be all> wrong...the problem here is the current Python implementationdoesn't use UTF-8 in the same way as Perl and Tcl.  Perland Tcl only exposes one string type, and that type be-haves exactly like it should:     "The Tcl string functions properly handle multi-    byte UTF-8 characters as single characters."     "By default, Perl now thinks in terms of Unicode     characters instead of simple bytes. /.../ All the     relevant built-in functions (length, reverse, and     so on) now work on a character-by-character     basis instead of byte-by-byte, and strings are     represented internally in Unicode."or in other words, both languages guarantee that given astring s:    - s is a sequence of characters (not bytes)    - len(s) is the number of characters in the string    - s[i] is the i'th character    - len(s[i]) is 1and as I've pointed out a zillion times, Python 1.6a2 doesn't. thisshould be solved, and I see (at least) four ways to do that:-- the Tcl 8.1 way: make 8-bit strings UTF-8 aware.  operations   like len and getitem usually searches from the start of the string.   to handle binary data, introduce a special ByteArray type.  when   mixing ByteArrays and strings, treat each byte in the array as an   8-bit unicode character (conversions from strings to byte arrays   are lossy).[imho: lots of code, and seriously affects performance, even whenunicode characters are never used.  this approach was abandonedin Tcl 8.2]-- the Tcl 8.2 way: use a unified string type, which stores data as   UTF-8 and/or 16-bit unicode:        struct {            char* bytes; /* 8-bit representation (utf-8) */            Tcl_UniChar* unicode; /* 16-bit representation */        }   if one of the strings are modified, the other is regenerated on   demand.  operations like len, slice and getitem always convert   to 16-bit first.   still need a ByteArray type, similar to the one described above.[imho: faster than before, but still not as good as a pure 8-bit stringtype.  and the need for a separate byte array type would break alotof existing Python code]-- the Perl 5.6 way? (haven't looked at the implementation, but I'm   pretty sure someone told me it was done this way).   essentially   same as Tcl 8.2, but with an extra encoding field (to avoid con-   versions if data is just passed through).    struct {        int encoding;        char* bytes; /* 8-bit representation */        Tcl_UniChar* unicode; /* 16-bit representation */    }[imho: see Tcl 8.2]-- my proposal: expose both types, but let them contain characters   from the same character set -- at least when used as strings.   as before, 8-bit strings can be used to store binary data, so we   don't need a separate ByteArray type.  in an 8-bit string, there's   always one character per byte.[imho: small changes to the existing code base, about as efficient ascan be, no attempt to second-guess the user, fully backwards com-patible, fully compliant with the definition of strings in the languagereference, patches are available, etc...]</F>

Previous message:[Python-Dev] Encoding of 8-bit strings and Python source code
Next message:[Python-Dev] Encoding of 8-bit strings and Python source code
Messages sorted by:[ date ][ thread ][ subject ][ author ]

[8]ページ先頭