Movatterモバイル変換


[0]ホーム

URL:


Following system colour schemeSelected dark colour schemeSelected light colour scheme

Python Enhancement Proposals

PEP 3138 – String representation in Python 3000

Author:
Atsuo Ishimoto <ishimoto at gembook.org>
Status:
Final
Type:
Standards Track
Created:
05-May-2008
Python-Version:
3.0
Post-History:
05-May-2008, 05-Jun-2008

Table of Contents

Abstract

This PEP proposes a new string representation form for Python 3000.In Python prior to Python 3000, the repr() built-in function convertedarbitrary objects to printable ASCII strings for debugging andlogging. For Python 3000, a wider range of characters, based on theUnicode standard, should be considered ‘printable’.

Motivation

The current repr() converts 8-bit strings to ASCII using followingalgorithm.

  • Convert CR, LF, TAB and ‘\’ to ‘\r’, ‘\n’, ‘\t’, ‘\\’.
  • Convert other non-printable characters(0x00-0x1f, 0x7f) andnon-ASCII characters (>= 0x80) to ‘\xXX’.
  • Backslash-escape quote characters (apostrophe, ‘) and add the quotecharacter at the beginning and the end.

For Unicode strings, the following additional conversions are done.

  • Convert leading surrogate pair characters without trailing character(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to ‘\uXXXX’.
  • Convert 16-bit characters (>= 0x100) to ‘\uXXXX’.
  • Convert 21-bit characters (>= 0x10000) and surrogate pair charactersto ‘\U00xxxxxx’.

This algorithm converts any string to printable ASCII, and repr() isused as a handy and safe way to print strings for debugging or forlogging. Although all non-ASCII characters are escaped, this does notmatter when most of the string’s characters are ASCII. But for otherlanguages, such as Japanese where most characters in a string are notASCII, this is very inconvenient.

We can useprint(aJapaneseString) to get a readable string, but wedon’t have a similar workaround for printing strings from collectionssuch as lists or tuples.print(listOfJapaneseStrings) uses repr()to build the string to be printed, so the resulting strings are alwayshex-escaped. Or whenopen(japaneseFilename) raises an exception,the error message is something likeIOError:[Errno2]Nosuchfileordirectory:'\u65e5\u672c\u8a9e', which isn’t helpful.

Python 3000 has a lot of nice features for non-Latin users such asnon-ASCII identifiers, so it would be helpful if Python could alsoprogress in a similar way for printable output.

Some users might be concerned that such output will mess up theirconsole if they print binary data like images. But this is unlikelyto happen in practice because bytes and strings are different types inPython 3000, so printing an image to the console won’t mess it up.

This issue was once discussed by Hye-Shik Chang[1], but was rejected.

Specification

  • Add a new function to the Python C APIintPy_UNICODE_ISPRINTABLE(Py_UNICODEch). This function returns 0 if repr() should escapethe Unicode characterch; otherwise it returns 1. Charactersthat should be escaped are defined in the Unicode character databaseas:
    • Cc (Other, Control)
    • Cf (Other, Format)
    • Cs (Other, Surrogate)
    • Co (Other, Private Use)
    • Cn (Other, Not Assigned)
    • Zl (Separator, Line), refers to LINE SEPARATOR (’\u2028’).
    • Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR(’\u2029’).
    • Zs (Separator, Space) other than ASCII space (’\x20’). Charactersin this category should be escaped to avoid ambiguity.
  • The algorithm to build repr() strings should be changed to:
    • Convert CR, LF, TAB and ‘\’ to ‘\r’, ‘\n’, ‘\t’, ‘\\’.
    • Convert non-printable ASCII characters (0x00-0x1f, 0x7f) to‘\xXX’.
    • Convert leading surrogate pair characters without trailingcharacter (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to‘\uXXXX’.
    • Convert non-printable characters (Py_UNICODE_ISPRINTABLE() returns0) to ‘\xXX’, ‘\uXXXX’ or ‘\U00xxxxxx’.
    • Backslash-escape quote characters (apostrophe, 0x27) and add aquote character at the beginning and the end.
  • Set the Unicode error-handler for sys.stderr to ‘backslashreplace’by default.
  • Add a new function to the Python C APIPyObject*PyObject_ASCII(PyObject*o). This function converts any python object to astring using PyObject_Repr() and then hex-escapes all non-ASCIIcharacters.PyObject_ASCII() generates the same string asPyObject_Repr() in Python 2.
  • Add a new built-in function,ascii(). This function convertsany python object to a string using repr() and then hex-escapes allnon-ASCII characters.ascii() generates the same string asrepr() in Python 2.
  • Add a'%a' string format operator.'%a' converts any pythonobject to a string using repr() and then hex-escapes all non-ASCIIcharacters. The'%a' format operator generates the same stringas'%r' in Python 2. Also, add'!a' conversion flags to thestring.format() method and add'%A' operator to thePyUnicode_FromFormat(). They convert any object to an ASCII stringas'%a' string format operator.
  • Add anisprintable() method to the string type.str.isprintable() returns False if repr() would escape anycharacter in the string; otherwise returns True. Theisprintable() method calls thePy_UNICODE_ISPRINTABLE()function internally.

Rationale

The repr() in Python 3000 should be Unicode, not ASCII based, justlike Python 3000 strings. Also, conversion should not be affected bythe locale setting, because the locale is not necessarily the same asthe output device’s locale. For example, it is common for a daemonprocess to be invoked in an ASCII setting, but writes UTF-8 to its logfiles. Also, web applications might want to report the errorinformation in more readable form based on the HTML page’s encoding.

Characters not supported by the user’s console could be hex-escaped onprinting, by the Unicode encoder’s error-handler. If theerror-handler of the output file is ‘backslashreplace’, suchcharacters are hex-escaped without raising UnicodeEncodeError. Forexample, if the default encoding is ASCII,print('Hello¢') willprint ‘Hello \xa2’. If the encoding is ISO-8859-1, ‘Hello ¢’ will beprinted.

The default error-handler for sys.stdout is ‘strict’. Otherapplications reading the output might not understand hex-escapedcharacters, so unsupported characters should be trapped when writing.If unsupported characters must be escaped, the error-handler should bechanged explicitly. Unlike sys.stdout, sys.stderr doesn’t raiseUnicodeEncodingError by default, because the default error-handler is‘backslashreplace’. So printing error messages containing non-ASCIIcharacters to sys.stderr will not raise an exception. Also,information about uncaught exceptions (exception object, traceback) isprinted by the interpreter without raising exceptions.

Alternate Solutions

To help debugging in non-Latin languages without changing repr(),other suggestions were made.

  • Supply a tool to print lists or dicts.

    Strings to be printed for debugging are not only contained by listsor dicts, but also in many other types of object. File objectscontain a file name in Unicode, exception objects contain a messagein Unicode, etc. These strings should be printed in readable formwhen repr()ed. It is unlikely to be possible to implement a tool toprint all possible object types.

  • Use sys.displayhook and sys.excepthook.

    For interactive sessions, we can write hooks to restore hex escapedcharacters to the original characters. But these hooks are calledonly when printing the result of evaluating an expression entered inan interactive Python session, and don’t work for theprint()function, for non-interactive sessions or forlogging.debug("%r",...), etc.

  • Subclass sys.stdout and sys.stderr.

    It is difficult to implement a subclass to restore hex-escapedcharacters since there isn’t enough information left by the timeit’s a string to undo the escaping correctly in all cases. Forexample,print("\\"+"u0041") should be printed as ‘\u0041’, not‘A’. But there is no chance to tell file objects apart.

  • Make the encoding used by unicode_repr() adjustable, and make theexisting repr() the default.

    With adjustable repr(), the result of using repr() is unpredictableand would make it impossible to write correct code involving repr().And if current repr() is the default, then the old conventionremains intact and users may expect ASCII strings as the result ofrepr(). Third party applications or libraries could be confusedwhen a custom repr() function is used.

Backwards Compatibility

Changing repr() may break some existing code, especially testing code.Five of Python’s regression tests fail with this modification. If youneed repr() strings without non-ASCII character as Python 2, you canuse the following function.

defrepr_ascii(obj):returnstr(repr(obj).encode("ASCII","backslashreplace"),"ASCII")

For logging or for debugging, the following code can raiseUnicodeEncodeError.

log=open("logfile","w")log.write(repr(data))# UnicodeEncodeError will be raised# if data contains unsupported characters.

To avoid exceptions being raised, you can explicitly specify theerror-handler.

log=open("logfile","w",errors="backslashreplace")log.write(repr(data))# Unsupported characters will be escaped.

For a console that uses a Unicode-based encoding, for example,en_US.utf8 or de_DE.utf8, the backslashreplace trick doesn’t work andall printable characters are not escaped. This will cause a problemof similarly drawing characters in Western, Greek and Cyrilliclanguages. These languages use similar (but different) alphabets(descended from a common ancestor) and contain letters that looksimilar but have different character codes. For example, it is hardto distinguish Latin ‘a’, ‘e’ and ‘o’ from Cyrillic ‘а’, ‘е’ and ‘о’.(The visual representation, of course, very much depends on the fontsused but usually these letters are almost indistinguishable.) Toavoid the problem, the user can adjust the terminal encoding to get aresult suitable for their environment.

Rejected Proposals

  • Add encoding and errors arguments to the builtin print() function,with defaults of sys.getfilesystemencoding() and ‘backslashreplace’.

    Complicated to implement, and in general, this is not seen as a goodidea.[2]

  • Use character names to escape characters, instead of hex charactercodes. For example,repr('\u03b1') can be converted to"\N{GREEKSMALLLETTERALPHA}".

    Using character names can be very verbose compared to hex-escape.e.g.,repr("\ufbf9") is converted to"\N{ARABICLIGATUREUIGHURKIRGHIZYEHWITHHAMZAABOVEWITHALEFMAKSURAISOLATEDFORM}".

  • Default error-handler of sys.stdout should be ‘backslashreplace’.

    Stuff written to stdout might be consumed by another program thatmight misinterpret the \ escapes. For interactive sessions, it ispossible to make the ‘backslashreplace’ error-handler the default,but this may add confusion of the kind “it works in interactive modebut not when redirecting to a file”.

Implementation

The author wrote a patch inhttp://bugs.python.org/issue2630; this wascommitted to the Python 3.0 branch in revision 64138 on 06-11-2008.

References

[1]
Multibyte string on string::string_print(http://bugs.python.org/issue479898)
[2]
[Python-3000] Displaying strings containing unicode escapes(https://mail.python.org/pipermail/python-3000/2008-April/013366.html)

Copyright

This document has been placed in the public domain.


Source:https://github.com/python/peps/blob/main/peps/pep-3138.rst

Last modified:2025-02-01 08:59:27 GMT


[8]ページ先頭

©2009-2025 Movatter.jp