Python Enhancement Proposals

Python »
PEP Index »
PEP 529

PEP 529 – Change Windows filesystem encoding to UTF-8

Author:: Steve Dower <steve.dower at python.org>
Status:

Table of Contents

Abstract

Historically, Python uses the ANSI APIs for interacting with the Windowsoperating system, often via C Runtime functions. However, these have been longdiscouraged in favor of the UTF-16 APIs. Within the operating system, all textis represented as UTF-16, and the ANSI APIs perform encoding and decoding usingthe active code page. SeeNaming Files, Paths, and Namespaces formore details.

This PEP proposes changing the default filesystem encoding on Windows to utf-8,and changing all filesystem functions to use the Unicode APIs for filesystempaths. This will not affect code that uses strings to represent paths, howeverthose that use bytes for paths will now be able to correctly round-trip allvalid paths in Windows filesystems. Currently, the conversions between Unicode(in the OS) and bytes (in Python) were lossy and would fail to round-tripcharacters outside of the user’s active code page.

Notably, this does not impact the encoding of the contents of files. These willcontinue to default tolocale.getpreferredencoding() (for text files) orplain bytes (for binary files). This only affects the encoding used when userspass a bytes object to Python where it is then passed to the operating system asa path name.

Background

File system paths are almost universally represented as text with an encodingdetermined by the file system. In Python, we expose these paths via a number ofinterfaces, such as theos andio modules. Paths may be passed eitherdirection across these interfaces, that is, from the filesystem to theapplication (for example,os.listdir()), or from the application to thefilesystem (for example,os.unlink()).

When paths are passed between the filesystem and the application, they areeither passed through as a bytes blob or converted to/from str usingos.fsencode() andos.fsdecode() or explicit encoding usingsys.getfilesystemencoding(). The result of encoding a string withsys.getfilesystemencoding() is a blob of bytes in the native format for thedefault file system.

On Windows, the native format for the filesystem is utf-16-le. The recommendedplatform APIs for accessing the filesystem all accept and return text encoded inthis format. However, prior to Windows NT (and possibly further back), thenative format was a configurable machine option and a separate set of APIsexisted to accept this format. The option (the “active code page”) and theseAPIs (the “*A functions”) still exist in recent versions of Windows forbackwards compatibility, though new functionality often only has a utf-16-le API(the “*W functions”).

In Python, str is recommended because it can correctly round-trip all charactersused in paths (on POSIX with surrogateescape handling; on Windows because strmaps to the native representation). On Windows bytes cannot round-trip allcharacters used in paths, as Python internally uses the *A functions and hencethe encoding is “whatever the active code page is”. Since the active code pagecannot represent all Unicode characters, the conversion of a path into bytes canlose information without warning or any available indication.

As a demonstration of this:

>>>open('test\uAB00.txt','wb').close()>>>importglob>>>glob.glob('test*')['test\uab00.txt']>>>glob.glob(b'test*')[b'test?.txt']

The Unicode character in the second call to glob has been replaced by a ‘?’,which means passing the path back into the filesystem will result in aFileNotFoundError. The same results may be observed withos.listdir() orany function that matches the return type to the parameter type.

While one user-accessible fix is to use str everywhere, POSIX systems generallydo not suffer from data loss when using bytes exclusively as the bytes are thecanonical representation. Even if the encoding is “incorrect” by some standard,the file system will still map the bytes back to the file. Making use of thisavoids the cost of decoding and reencoding, such that (theoretically, and onlyon POSIX), code such as this may be faster because of the use ofb'.'compared to using'.':

>>>forfinos.listdir(b'.'):...os.stat(f)...

As a result, POSIX-focused library authors prefer to use bytes to representpaths. For some authors it is also a convenience, as their code may receivebytes already known to be encoded correctly, while others are attempting tosimplify porting their code from Python 2. However, the correctness assumptionsdo not carry over to Windows where Unicode is the canonical representation, anderrors may result. This potential data loss is why the use of bytes paths onWindows was deprecated in Python 3.3 - all of the above code snippets producedeprecation warnings on Windows.

Proposal

Currently the default filesystem encoding is ‘mbcs’, which is a meta-encoderthat uses the active code page. However, when bytes are passed to the filesystemthey go through the *A APIs and the operating system handles encoding. In thiscase, paths are always encoded using the equivalent of ‘mbcs:replace’ with noopportunity for Python to override or change this.

This proposal would remove all use of the *A APIs and only ever call the *WAPIs. When Windows returns paths to Python asstr, they will be decoded fromutf-16-le and returned as text (in whatever the minimal representation is). WhenPython code requests paths asbytes, the paths will be transcoded fromutf-16-le into utf-8 using surrogatepass (Windows does not validate surrogatepairs, so it is possible to have invalid surrogates in filenames). Equally, whenpaths are provided asbytes, they are transcoded from utf-8 into utf-16-leand passed to the *W APIs.

The use of utf-8 will not be configurable, except for the provision of a“legacy mode” flag to revert to the previous behaviour.

Thesurrogateescape error mode does not apply here, as the concern is notabout retaining nonsensical bytes. Any path returned from the operating systemwill be valid Unicode, while invalid paths created by the user should raise adecoding error (currently these would raiseOSError or a subclass).

The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure theability to round-trip path names and allow basic manipulation (for example,using theos.path module) when assuming an ASCII-compatible encoding. Usingutf-16-le as the encoding is more pure, but will cause more issues than areresolved.

This change would also undeprecate the use of bytes paths on Windows. No changeto the semantics of using bytes as a path is required - as before, they must beencoded with the encoding specified bysys.getfilesystemencoding().

Specific Changes

Update sys.getfilesystemencoding

Remove the default value forPy_FileSystemDefaultEncoding and set it ininitfsencoding() to utf-8, or if the legacy-mode switch is enabled to mbcs.

Update the implementations ofPyUnicode_DecodeFSDefaultAndSize() andPyUnicode_EncodeFSDefault() to use the utf-8 codec, or if the legacy-modeswitch is enabled the existing mbcs codec.

Add sys.getfilesystemencodeerrors

As the error mode may now change betweensurrogatepass andreplace,Python code that manually performs encoding also needs access to the currenterror mode. This includes the implementation ofos.fsencode() andos.fsdecode(), which currently assume an error mode based on the codec.

Add a publicPy_FileSystemDefaultEncodeErrors, similar to the existingPy_FileSystemDefaultEncoding. The default value on Windows will besurrogatepass or in legacy mode,replace. The default value on all otherplatforms will besurrogateescape.

Add a publicsys.getfilesystemencodeerrors() function that returns thecurrent error mode.

Update the implementations ofPyUnicode_DecodeFSDefaultAndSize() andPyUnicode_EncodeFSDefault() to use the variable for error mode rather thanconstant strings.

Update the implementations ofos.fsencode() andos.fsdecode() to usesys.getfilesystemencodeerrors() instead of assuming the mode.

Update path_converter

Update the path converter to always decode bytes or buffer objects into textusingPyUnicode_DecodeFSDefaultAndSize().

Change thenarrow field from achar* string into a flag that indicateswhether the original object was bytes. This is required for functions that needto return paths using the same type as was originally provided.

Remove unused ANSI code

Remove all code paths using thenarrow field, as these will no longer bereachable by any caller. These are only used withinposixmodule.c. Otheruses of paths should have use of bytes paths replaced with decoding and use ofthe *W APIs.

Add legacy mode

Add a legacy mode flag, enabled by the environment variablePYTHONLEGACYWINDOWSFSENCODING or by a function call tosys._enablelegacywindowsfsencoding(). The function call can only beused to enable the flag and should be used by programs as close toinitialization as possible. Legacy mode cannot be disabled while Python isrunning.

When this flag is set, the default filesystem encoding is set to mbcs ratherthan utf-8, and the error mode is set toreplace rather thansurrogatepass. Paths will continue to decode to wide characters and only *WAPIs will be called, however, the bytes passed in and received from Python willbe encoded the same as prior to this change.

Undeprecate bytes paths on Windows

Using bytes as paths on Windows is currently deprecated. We would announce thatthis is no longer the case, and that paths when encoded as bytes should usewhatever is returned fromsys.getfilesystemencoding() rather than the user’sactive code page.

Beta experiment

To assist with determining the impact of this change, we propose applying it to3.6.0b1 provisionally with the intent being to make a final decision before3.6.0b4.

During the experiment period, decoding and encoding exception messages will beexpanded to include a link to an active online discussion and encouragereporting of problems.

If it is decided to revert the functionality for 3.6.0b4, the implementationchange would be to permanently enable the legacy mode flag, change theenvironment variable toPYTHONWINDOWSUTF8FSENCODING and function tosys._enablewindowsutf8fsencoding() to allow enabling the functionalityon a case-by-case basis, as opposed to disabling it.

It is expected that if we cannot feasibly make the change for 3.6 due tocompatibility concerns, it will not be possible to make the change at any latertime in Python 3.x.

Affected Modules

This PEP implicitly includes all modules within the Python that either pass pathnames to the operating system, or otherwise usesys.getfilesystemencoding().

As of 3.6.0a4, the following modules require modification:

os
_overlapped
_socket
subprocess
zipimport

The following modules usesys.getfilesystemencoding() but do not needmodification:

gc (already assumes bytes are utf-8)
grp (not compiled for Windows)
http.server (correctly includes codec name with transmitted data)
idlelib.editor (should not be needed; has fallback handling)
nis (not compiled for Windows)
pwd (not compiled for Windows)
spwd (not compiled for Windows)
_ssl (only used for ASCII constants)
tarfile (code unused on Windows)
_tkinter (already assumes bytes are utf-8)
wsgiref (assumed as the default encoding for unknown environments)
zipapp (code unused on Windows)

The following native code uses one of the encoding or decoding functions, but donot require any modification:

Parser/parsetok.c (docs already specifysys.getfilesystemencoding())
Python/ast.c (docs already specifysys.getfilesystemencoding())
Python/compile.c (undocumented, but Python filesystem encoding implied)
Python/errors.c (docs already specifyos.fsdecode())
Python/fileutils.c (code unused on Windows)
Python/future.c (undocumented, but Python filesystem encoding implied)
Python/import.c (docs already specify utf-8)
Python/importdl.c (code unused on Windows)
Python/pythonrun.c (docs already specifysys.getfilesystemencoding())
Python/symtable.c (undocumented, but Python filesystem encoding implied)
Python/thread.c (code unused on Windows)
Python/traceback.c (encodes correctly for comparing strings)
Python/_warnings.c (docs already specifyos.fsdecode())

Rejected Alternatives

Use strict mbcs decoding

This is essentially the same as the proposed change, but instead of changingsys.getfilesystemencoding() to utf-8 it is changed to mbcs (whichdynamically maps to the active code page).

This approach allows the use of new functionality that is only available as *WAPIs and also detection of encoding/decoding errors. For example, rather thansilently replacing Unicode characters with ‘?’, it would be possible to warn orfail the operation.

Compared to the proposed fix, this could enable some new functionality but doesnot fix any of the problems described initially. New runtime errors may causesome problems to be more obvious and lead to fixes, provided library maintainersare interested in supporting Windows and adding a separate code path to treatfilesystem paths as strings.

Making the encoding mbcs without strict errors is equivalent to the legacy-modeswitch being enabled by default. This is a possible course of action if there issignificant breakage of actual code and a need to extend the deprecation period,but still a desire to have the simplifications to the CPython source.

Make bytes paths an error on Windows

By preventing the use of bytes paths on Windows completely we prevent users fromhitting encoding issues.

However, the motivation for this PEP is to increase the likelihood that codewritten on POSIX will also work correctly on Windows. This alternative wouldmove the other direction and make such code completely incompatible. As thisdoes not benefit users in any way, we reject it.

Make bytes paths an error on all platforms

By deprecating and then disable the use of bytes paths on all platforms weprevent users from hitting encoding issues regardless of where the code wasoriginally written. This would require a full deprecation cycle, as there arecurrently no warnings on platforms other than Windows.

This is likely to be seen as a hostile action against Python developers ingeneral, and as such is rejected at this time.

Code that may break

The following code patterns may break or see different behaviour as a result ofthis change. Each of these examples would have been fragile in code intended forcross-platform use. The suggested fixes demonstrate the most compatible way tohandle path encoding issues across all platforms and across multiple Pythonversions.

Note that all of these examples produce deprecation warnings on Python 3.3 andlater.

Not managing encodings across boundaries

Code that does not manage encodings when crossing protocol boundaries maycurrently be working by chance, but could encounter issues when either encodingchanges. Note that the source offilename may be any function that returnsa bytes object, as illustrated in a second example below:

>>>filename=open('filename_in_mbcs.txt','rb').read()>>>text=open(filename,'r').read()

To correct this code, the encoding of the bytes infilename should bespecified, either when reading from the file or before using the value:

>>># Fix 1: Open file as text (default encoding)>>>filename=open('filename_in_mbcs.txt','r').read()>>>text=open(filename,'r').read()>>># Fix 2: Open file as text (explicit encoding)>>>filename=open('filename_in_mbcs.txt','r',encoding='mbcs').read()>>>text=open(filename,'r').read()>>># Fix 3: Explicitly decode the path>>>filename=open('filename_in_mbcs.txt','rb').read()>>>text=open(filename.decode('mbcs'),'r').read()

Where the creator offilename is separated from the user offilename,the encoding is important information to include:

>>>some_object.filename=r'C:\Users\Steve\Documents\my_file.txt'.encode('mbcs')>>>filename=some_object.filename>>>type(filename)<class 'bytes'>>>>text=open(filename,'r').read()

To fix this code for best compatibility across operating systems and Pythonversions, the filename should be exposed as str:

>>># Fix 1: Expose as str>>>some_object.filename=r'C:\Users\Steve\Documents\my_file.txt'>>>filename=some_object.filename>>>type(filename)<class 'str'>>>>text=open(filename,'r').read()

Alternatively, the encoding used for the path needs to be made available to theuser. Specifyingos.fsencode() (orsys.getfilesystemencoding()) is anacceptable choice, or a new attribute could be added with the exact encoding:

>>># Fix 2: Use fsencode>>>some_object.filename=os.fsencode(r'C:\Users\Steve\Documents\my_file.txt')>>>filename=some_object.filename>>>type(filename)<class 'bytes'>>>>text=open(filename,'r').read()>>># Fix 3: Expose as explicit encoding>>>some_object.filename=r'C:\Users\Steve\Documents\my_file.txt'.encode('cp437')>>>some_object.filename_encoding='cp437'>>>filename=some_object.filename>>>type(filename)<class 'bytes'>>>>filename=filename.decode(some_object.filename_encoding)>>>type(filename)<class 'str'>>>>text=open(filename,'r').read()

Explicitly using ‘mbcs’

Code that explicitly encodes text using ‘mbcs’ before passing to file systemAPIs is now passing incorrectly encoded bytes. Note that the source offilename in this example is not relevant, provided that it is a str:

>>>filename=open('files.txt','r').readline().rstrip()>>>text=open(filename.encode('mbcs'),'r')

To correct this code, the string should be passed without explicit encoding, orshould useos.fsencode():

>>># Fix 1: Do not encode the string>>>filename=open('files.txt','r').readline().rstrip()>>>text=open(filename,'r')>>># Fix 2: Use correct encoding>>>filename=open('files.txt','r').readline().rstrip()>>>text=open(os.fsencode(filename),'r')

Copyright

This document has been placed in the public domain.

Source:https://github.com/python/peps/blob/main/peps/pep-0529.rst

Last modified:2025-02-01 08:59:27 GMT

Movatterモバイル変換

PEP 529 – Change Windows filesystem encoding to UTF-8