This PEP proposes a new string representation form for Python 3000.In Python prior to Python 3000, the repr() built-in function convertedarbitrary objects to printable ASCII strings for debugging andlogging. For Python 3000, a wider range of characters, based on theUnicode standard, should be considered ‘printable’.
The current repr() converts 8-bit strings to ASCII using followingalgorithm.
For Unicode strings, the following additional conversions are done.
This algorithm converts any string to printable ASCII, and repr() isused as a handy and safe way to print strings for debugging or forlogging. Although all non-ASCII characters are escaped, this does notmatter when most of the string’s characters are ASCII. But for otherlanguages, such as Japanese where most characters in a string are notASCII, this is very inconvenient.
We can useprint(aJapaneseString) to get a readable string, but wedon’t have a similar workaround for printing strings from collectionssuch as lists or tuples.print(listOfJapaneseStrings) uses repr()to build the string to be printed, so the resulting strings are alwayshex-escaped. Or whenopen(japaneseFilename) raises an exception,the error message is something likeIOError:[Errno2]Nosuchfileordirectory:'\u65e5\u672c\u8a9e', which isn’t helpful.
Python 3000 has a lot of nice features for non-Latin users such asnon-ASCII identifiers, so it would be helpful if Python could alsoprogress in a similar way for printable output.
Some users might be concerned that such output will mess up theirconsole if they print binary data like images. But this is unlikelyto happen in practice because bytes and strings are different types inPython 3000, so printing an image to the console won’t mess it up.
This issue was once discussed by Hye-Shik Chang[1], but was rejected.
intPy_UNICODE_ISPRINTABLE(Py_UNICODEch). This function returns 0 if repr() should escapethe Unicode characterch; otherwise it returns 1. Charactersthat should be escaped are defined in the Unicode character databaseas:PyObject*PyObject_ASCII(PyObject*o). This function converts any python object to astring using PyObject_Repr() and then hex-escapes all non-ASCIIcharacters.PyObject_ASCII() generates the same string asPyObject_Repr() in Python 2.ascii(). This function convertsany python object to a string using repr() and then hex-escapes allnon-ASCII characters.ascii() generates the same string asrepr() in Python 2.'%a' string format operator.'%a' converts any pythonobject to a string using repr() and then hex-escapes all non-ASCIIcharacters. The'%a' format operator generates the same stringas'%r' in Python 2. Also, add'!a' conversion flags to thestring.format() method and add'%A' operator to thePyUnicode_FromFormat(). They convert any object to an ASCII stringas'%a' string format operator.isprintable() method to the string type.str.isprintable() returns False if repr() would escape anycharacter in the string; otherwise returns True. Theisprintable() method calls thePy_UNICODE_ISPRINTABLE()function internally.The repr() in Python 3000 should be Unicode, not ASCII based, justlike Python 3000 strings. Also, conversion should not be affected bythe locale setting, because the locale is not necessarily the same asthe output device’s locale. For example, it is common for a daemonprocess to be invoked in an ASCII setting, but writes UTF-8 to its logfiles. Also, web applications might want to report the errorinformation in more readable form based on the HTML page’s encoding.
Characters not supported by the user’s console could be hex-escaped onprinting, by the Unicode encoder’s error-handler. If theerror-handler of the output file is ‘backslashreplace’, suchcharacters are hex-escaped without raising UnicodeEncodeError. Forexample, if the default encoding is ASCII,print('Hello¢') willprint ‘Hello \xa2’. If the encoding is ISO-8859-1, ‘Hello ¢’ will beprinted.
The default error-handler for sys.stdout is ‘strict’. Otherapplications reading the output might not understand hex-escapedcharacters, so unsupported characters should be trapped when writing.If unsupported characters must be escaped, the error-handler should bechanged explicitly. Unlike sys.stdout, sys.stderr doesn’t raiseUnicodeEncodingError by default, because the default error-handler is‘backslashreplace’. So printing error messages containing non-ASCIIcharacters to sys.stderr will not raise an exception. Also,information about uncaught exceptions (exception object, traceback) isprinted by the interpreter without raising exceptions.
To help debugging in non-Latin languages without changing repr(),other suggestions were made.
Strings to be printed for debugging are not only contained by listsor dicts, but also in many other types of object. File objectscontain a file name in Unicode, exception objects contain a messagein Unicode, etc. These strings should be printed in readable formwhen repr()ed. It is unlikely to be possible to implement a tool toprint all possible object types.
For interactive sessions, we can write hooks to restore hex escapedcharacters to the original characters. But these hooks are calledonly when printing the result of evaluating an expression entered inan interactive Python session, and don’t work for theprint()function, for non-interactive sessions or forlogging.debug("%r",...), etc.
It is difficult to implement a subclass to restore hex-escapedcharacters since there isn’t enough information left by the timeit’s a string to undo the escaping correctly in all cases. Forexample,print("\\"+"u0041") should be printed as ‘\u0041’, not‘A’. But there is no chance to tell file objects apart.
With adjustable repr(), the result of using repr() is unpredictableand would make it impossible to write correct code involving repr().And if current repr() is the default, then the old conventionremains intact and users may expect ASCII strings as the result ofrepr(). Third party applications or libraries could be confusedwhen a custom repr() function is used.
Changing repr() may break some existing code, especially testing code.Five of Python’s regression tests fail with this modification. If youneed repr() strings without non-ASCII character as Python 2, you canuse the following function.
defrepr_ascii(obj):returnstr(repr(obj).encode("ASCII","backslashreplace"),"ASCII")
For logging or for debugging, the following code can raiseUnicodeEncodeError.
log=open("logfile","w")log.write(repr(data))# UnicodeEncodeError will be raised# if data contains unsupported characters.
To avoid exceptions being raised, you can explicitly specify theerror-handler.
log=open("logfile","w",errors="backslashreplace")log.write(repr(data))# Unsupported characters will be escaped.
For a console that uses a Unicode-based encoding, for example,en_US.utf8 or de_DE.utf8, the backslashreplace trick doesn’t work andall printable characters are not escaped. This will cause a problemof similarly drawing characters in Western, Greek and Cyrilliclanguages. These languages use similar (but different) alphabets(descended from a common ancestor) and contain letters that looksimilar but have different character codes. For example, it is hardto distinguish Latin ‘a’, ‘e’ and ‘o’ from Cyrillic ‘а’, ‘е’ and ‘о’.(The visual representation, of course, very much depends on the fontsused but usually these letters are almost indistinguishable.) Toavoid the problem, the user can adjust the terminal encoding to get aresult suitable for their environment.
Complicated to implement, and in general, this is not seen as a goodidea.[2]
repr('\u03b1') can be converted to"\N{GREEKSMALLLETTERALPHA}".Using character names can be very verbose compared to hex-escape.e.g.,repr("\ufbf9") is converted to"\N{ARABICLIGATUREUIGHURKIRGHIZYEHWITHHAMZAABOVEWITHALEFMAKSURAISOLATEDFORM}".
Stuff written to stdout might be consumed by another program thatmight misinterpret the \ escapes. For interactive sessions, it ispossible to make the ‘backslashreplace’ error-handler the default,but this may add confusion of the kind “it works in interactive modebut not when redirecting to a file”.
The author wrote a patch inhttp://bugs.python.org/issue2630; this wascommitted to the Python 3.0 branch in revision 64138 on 06-11-2008.
This document has been placed in the public domain.
Source:https://github.com/python/peps/blob/main/peps/pep-3138.rst
Last modified:2025-02-01 08:59:27 GMT