Movatterモバイル変換


[0]ホーム

URL:


Following system colour schemeSelected dark colour schemeSelected light colour scheme

Python Enhancement Proposals

PEP 540 – Add a new UTF-8 Mode

Author:
Victor Stinner <vstinner at python.org>
BDFL-Delegate:
INADA Naoki
Status:
Final
Type:
Standards Track
Created:
05-Jan-2016
Python-Version:
3.7
Resolution:
Python-Dev message

Table of Contents

Abstract

Add a new “UTF-8 Mode” to enhance Python’s use of UTF-8. When UTF-8 Modeis active, Python will:

  • use theutf-8 encoding, regardless of the locale currently set bythe current platform, and
  • change thestdin andstdout error handlers tosurrogateescape.

This mode is off by default, but is automatically activated when usingthe “POSIX” locale.

Add the-Xutf8 command line option andPYTHONUTF8 environmentvariable to control UTF-8 Mode.

Rationale

Locale encoding and UTF-8

Python 3.6 uses the locale encoding for filenames, environmentvariables, standard streams, etc. The locale encoding is inherited fromthe locale; the encoding and the locale are tightly coupled.

Many users inherit the ASCII encoding from the POSIX locale, aka the “C”locale, but are unable change the locale for various reasons. Thisencoding is very limited in term of Unicode support: any non-ASCIIcharacter is likely to cause trouble.

It isn’t always easy to get an accurate locale. Locales don’t get theexact same name on different Linux distributions, FreeBSD, macOS, etc.And some locales, like the recentC.UTF-8 locale, are only supportedby a few platforms. The current locale can even vary on thesameplatform depending on context; for example, a SSH connection can use adifferent encoding than the filesystem or local terminal encoding on thesame machine.

On the flip side, Python 3.6 is already using UTF-8 by default on macOS,Android and Windows (PEP 529) for most functions – althoughopen() is a notable exception here. UTF-8 is also the defaultencoding of Python scripts, XML and JSON file formats. The Goprogramming languageuses UTF-8 for all strings.

UTF-8 support is nearly ubiquitous for data read and written by modernplatforms. It also has excellent support in Python. The problem issimply that the locale is frequently misconfigured. An obvious solutionsuggests itself: ignore the locale encoding and use UTF-8.

Passthrough for undecodable bytes: surrogateescape

When decoding bytes from UTF-8 using the defaultstrict errorhandler, Python 3 raises aUnicodeDecodeError on the firstundecodable byte.

Unix command line tools likecat orgrep and most Python 2applications simply do not have this class of bugs: they don’t decodedata, but process data as a raw bytes sequence.

Python 3 already has a solution to behave like Unix tools and Python 2:thesurrogateescape error handler (PEP 383). It allows processingdata as if it were bytes, but uses Unicode in practice; undecodablebytes are stored as surrogate characters.

UTF-8 Mode sets thesurrogateescape error handler forstdinandstdout, since these streams as commonly associated to Unixcommand line tools.

However, users have a different expectation on files. Files are expectedto be properly encoded, and Python is expected to fail early whenopen() is called with the wrong options, like opening a JPEG picturein text mode. Theopen() default error handler remainsstrictfor these reasons.

No change by default for best backward compatibility

While UTF-8 is perfect in most cases, sometimes the locale encoding isactually the best encoding.

This PEP changes the behaviour for the POSIX locale since this locale isusually equivalent to the ASCII encoding, whereas UTF-8 is a much betterchoice. It does not change the behaviour for other locales to preventany risk or regression.

As users are responsible to enable explicitly the new UTF-8 Mode forthese other locales, they are responsible for any potential mojibakeissues caused by UTF-8 Mode.

Proposal

Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the localeencoding, and changestdin andstdout error handlers tosurrogateescape.

Add the new-Xutf8 command line option andPYTHONUTF8environment variable. Users can explicitly activate UTF-8 Mode with thecommand-line option-Xutf8 or by setting the environment variablePYTHONUTF8=1.

This mode is disabled by default and enabled by the POSIX locale. Userscan explicitly disable UTF-8 Mode with the command-line option-Xutf8=0 or by setting the environment variablePYTHONUTF8=0.

For standard streams, thePYTHONIOENCODING environment variable haspriority over UTF-8 Mode.

On Windows, thePYTHONLEGACYWINDOWSFSENCODING environment variable(PEP 529) has the priority over UTF-8 Mode.

Effects of UTF-8 Mode:

  • sys.getfilesystemencoding() returns'UTF-8'.
  • locale.getpreferredencoding() returnsUTF-8; itsdo_setlocale argument, and the locale encoding, are ignored.
  • sys.stdin andsys.stdout error handler is set tosurrogateescape.

Side effects:

  • open() uses the UTF-8 encoding by default. However, it stilluses thestrict error handler by default.
  • os.fsdecode() andos.fsencode() use the UTF-8 encoding.
  • Command line arguments, environment variables and filenames use theUTF-8 encoding.

Relationship with the locale coercion (PEP 538)

The POSIX locale enables the locale coercion (PEP 538) and the UTF-8mode (PEP 540). When the locale coercion is enabled, enabling theUTF-8 mode has no additional effect.

The UTF-8 Mode has the same effect as locale coercion:

  • sys.getfilesystemencoding() returns'UTF-8',
  • locale.getpreferredencoding() returnsUTF-8, and
  • thesys.stdin andsys.stdout error handlers are set tosurrogateescape.

These changes only affect Python code. But the locale coercion hasadditional effects: theLC_CTYPE environment variable and theLC_CTYPE locale are set to a UTF-8 locale likeC.UTF-8. One sideeffect is that non-Python code is also impacted by the locale coercion.The two PEPs are complementary.

On platforms like Centos 7 where locale coercion is not supported, thePOSIX locale only enables UTF-8 Mode. In this case, Python code usesthe UTF-8 encoding and ignores the locale encoding, whereas non-Pythoncode uses the locale encoding, which is usually ASCII for the POSIXlocale.

While the UTF-8 Mode is supported on all platforms and can be enabledwith any locale, the locale coercion is not supported by all platformsand is restricted to the POSIX locale.

The UTF-8 Mode has only an impact on Python child processes when thePYTHONUTF8 environment variable is set to1, whereas the localecoercion sets theLC_CTYPE environment variables which impacts allchild processes.

The benefit of the locale coercion approach is that it helps ensure thatencoding handling in binary extension modules and child processes isconsistent with Python’s encoding handling. The upside of the UTF-8 Modeapproach is that it allows an embedding application to change theinterpreter’s behaviour without having to change the process globallocale settings.

Backward Compatibility

The only backward incompatible change is that the POSIX locale nowenables the UTF-8 Mode by default: it will now use the UTF-8 encoding,ignore the locale encoding, and changestdin andstdout errorhandlers tosurrogateescape.

Annex: Encodings And Error Handlers

UTF-8 Mode changes the default encoding and error handler used byopen(),os.fsdecode(),os.fsencode(),sys.stdin,sys.stdout andsys.stderr.

Encoding and error handler

FunctionDefaultUTF-8 Mode or POSIX locale
open()locale/strictUTF-8/strict
os.fsdecode(), os.fsencode()locale/surrogateescapeUTF-8/surrogateescape
sys.stdin, sys.stdoutlocale/strictUTF-8/surrogateescape
sys.stderrlocale/backslashreplaceUTF-8/backslashreplace

By comparison, Python 3.6 uses:

FunctionDefaultPOSIX locale
open()locale/strictlocale/strict
os.fsdecode(), os.fsencode()locale/surrogateescapelocale/surrogateescape
sys.stdin, sys.stdoutlocale/strictlocale/surrogateescape
sys.stderrlocale/backslashreplacelocale/backslashreplace

Encoding and error handler on Windows

On Windows, the encodings and error handlers are different:

FunctionDefaultLegacy Windows FS encodingUTF-8 Mode
open()mbcs/strictmbcs/strictUTF-8/strict
os.fsdecode(), os.fsencode()UTF-8/surrogatepassmbcs/replaceUTF-8/surrogatepass
sys.stdin, sys.stdoutUTF-8/surrogateescapeUTF-8/surrogateescapeUTF-8/surrogateescape
sys.stderrUTF-8/backslashreplaceUTF-8/backslashreplaceUTF-8/backslashreplace

By comparison, Python 3.6 uses:

FunctionDefaultLegacy Windows FS encoding
open()mbcs/strictmbcs/strict
os.fsdecode(), os.fsencode()UTF-8/surrogatepassmbcs/replace
sys.stdin, sys.stdoutUTF-8/surrogateescapeUTF-8/surrogateescape
sys.stderrUTF-8/backslashreplaceUTF-8/backslashreplace

The “Legacy Windows FS encoding” is enabled by thePYTHONLEGACYWINDOWSFSENCODING environment variable.

If stdin and/or stdout is redirected to a pipe,sys.stdin and/orsys.stdout usesmbcs encoding by default rather than UTF-8.But in UTF-8 Mode,sys.stdin andsys.stdout always use the UTF-8encoding.

Note

There is no POSIX locale on Windows. The ANSI code page is used asthe locale encoding, and this code page never uses the ASCIIencoding.

Links

Post History

Version History

  • Version 4:locale.getpreferredencoding() now returns'UTF-8'in the UTF-8 Mode.
  • Version 3: The UTF-8 Mode does not change theopen() default errorhandler (strict) anymore, and the Strict UTF-8 Mode has beenremoved.
  • Version 2: Rewrite the PEP from scratch to make it much shorter andeasier to understand.
  • Version 1: First version posted to python-dev.

Copyright

This document has been placed in the public domain.


Source:https://github.com/python/peps/blob/main/peps/pep-0540.rst

Last modified:2025-02-01 08:59:27 GMT


[8]ページ先頭

©2009-2025 Movatter.jp