Python Enhancement Proposals

Python »
PEP Index »
PEP 528

PEP 528 – Change Windows console encoding to UTF-8

Author:: Steve Dower <steve.dower at python.org>
Status:

Table of Contents

Abstract

Historically, Python uses the ANSI APIs for interacting with the Windowsoperating system, often via C Runtime functions. However, these have been longdiscouraged in favor of the UTF-16 APIs. Within the operating system, all textis represented as UTF-16, and the ANSI APIs perform encoding and decoding usingthe active code page.

This PEP proposes changing the default standard stream implementation on Windowsto use the Unicode APIs. This will allow users to print and input the full rangeof Unicode characters at the default Windows console. This also requires asubtle change to how the tokenizer parses text from readline hooks.

Specific Changes

Add _io.WindowsConsoleIO

Currently an instance of_io.FileIO is used to wrap the file descriptorsrepresenting standard input, output and error. We add a new class (implementedin C)_io.WindowsConsoleIO that acts as a raw IO object using the Windowsconsole functions, specifically,ReadConsoleW andWriteConsoleW.

This class will be used when the legacy-mode flag is not in effect, when openinga standard stream by file descriptor and the stream is a console buffer ratherthan a redirected file. Otherwise,_io.FileIO will be used as it is today.

This is a raw (bytes) IO class that requires text to be passed encoded withutf-8, which will be decoded to utf-16-le and passed to the Windows APIs.Similarly, bytes read from the class will be provided by the operating system asutf-16-le and converted into utf-8 when returned to Python.

The use of an ASCII compatible encoding is required to maintain compatibilitywith code that bypasses theTextIOWrapper and directly writes ASCII bytes tothe standard streams (for example,Twisted’s process_stdinreader.py). Code that assumesa particular encoding for the standard streams other than ASCII will likelybreak.

Add _PyOS_WindowsConsoleReadline

To allow Unicode entry at the interactive prompt, a new readline hook isrequired. The existingPyOS_StdioReadline function will delegate to the new_PyOS_WindowsConsoleReadline function when reading from a file descriptorthat is a console buffer and the legacy-mode flag is not in effect (the logicshould be identical to above).

Since the readline interface is required to return an 8-bit encoded string withno embedded nulls, the_PyOS_WindowsConsoleReadline function transcodes fromutf-16-le as read from the operating system into utf-8.

The functionPyRun_InteractiveOneObject which currently obtains the encodingfromsys.stdin will select utf-8 unless the legacy-mode flag is in effect.This may require readline hooks to change their encodings to utf-8, or torequire legacy-mode for correct behaviour.

Add legacy mode

Launching Python with the environment variablePYTHONLEGACYWINDOWSSTDIO setwill enable the legacy-mode flag, which completely restores the previousbehaviour.

Alternative Approaches

Thewin_unicode_console package is a pure-Python alternative to changing thedefault behaviour of the console. It implements essentially the samemodifications as described here using pure Python code.

Code that may break

The following code patterns may break or see different behaviour as a result ofthis change. All of these code samples require explicitly choosing to use a rawfile object in place of a more convenient wrapper that would prevent any visiblechange.

Assuming stdin/stdout encoding

Code that assumes that the encoding required bysys.stdin.buffer orsys.stdout.buffer is'mbcs' or a more specific encoding may currently beworking by chance, but could encounter issues under this change. For example:

>>>sys.stdout.buffer.write(text.encode('mbcs'))>>>r=sys.stdin.buffer.read(16).decode('cp437')

To correct this code, the encoding specified on theTextIOWrapper should beused, either implicitly or explicitly:

>>># Fix 1: Use wrapper correctly>>>sys.stdout.write(text)>>>r=sys.stdin.read(16)>>># Fix 2: Use encoding explicitly>>>sys.stdout.buffer.write(text.encode(sys.stdout.encoding))>>>r=sys.stdin.buffer.read(16).decode(sys.stdin.encoding)

Incorrectly using the raw object

Code that uses the raw IO object and does not correctly handle partial reads andwrites may be affected. This is particularly important for reads, where thenumber of characters read will never exceed one-fourth of the number of bytesallowed, as there is no feasible way to prevent input from encoding as muchlonger utf-8 strings:

>>>raw_stdin=sys.stdin.buffer.raw>>>data=raw_stdin.read(15)abcdefghijklmb'abc'# data contains at most 3 characters, and never more than 12 bytes# error, as "defghijklm\r\n" is passed to the interactive prompt

To correct this code, the buffered reader/writer should be used, or the callershould continue reading until its buffer is full:

>>># Fix 1: Use the buffered reader/writer>>>stdin=sys.stdin.buffer>>>data=stdin.read(15)abcedfghijklmb'abcdefghijklm\r\n'>>># Fix 2: Loop until enough bytes have been read>>>raw_stdin=sys.stdin.buffer.raw>>>b=b''>>>whilelen(b)<15:...b+=raw_stdin.read(15)abcedfghijklmb'abcdefghijklm\r\n'

Using the raw object with small buffers

Code that uses the raw IO object and attempts to read less than four characterswill now receive an error. Because it’s possible that any single character mayrequire up to four bytes when represented in utf-8, requests must fail:

>>>raw_stdin=sys.stdin.buffer.raw>>>data=raw_stdin.read(3)Traceback (most recent call last):  File"<stdin>", line1, in<module>ValueError:must read at least 4 bytes

The only workaround is to pass a larger buffer:

>>># Fix: Request at least four bytes>>>raw_stdin=sys.stdin.buffer.raw>>>data=raw_stdin.read(4)ab'a'>>>>>>

(The extra>>> is due to the newline remaining in the input buffer and isexpected in this situation.)

Copyright

This document has been placed in the public domain.

Source:https://github.com/python/peps/blob/main/peps/pep-0528.rst

Last modified:2025-02-01 08:59:27 GMT

Movatterモバイル変換

PEP 528 – Change Windows console encoding to UTF-8