Python Enhancement Proposals

Python »
PEP Index »
PEP 263

PEP 263 – Defining Python Source Code Encodings

Author:: Marc-André Lemburg <mal at lemburg.com>,Martin von Löwis <martin at v.loewis.de>
Status:

Abstract

This PEP proposes to introduce a syntax to declare the encoding ofa Python source file. The encoding information is then used by thePython parser to interpret the file using the given encoding. Mostnotably this enhances the interpretation of Unicode literals inthe source code and makes it possible to write Unicode literalsusing e.g. UTF-8 directly in an Unicode aware editor.

Problem

In Python 2.1, Unicode literals can only be written using theLatin-1 based encoding “unicode-escape”. This makes theprogramming environment rather unfriendly to Python users who liveand work in non-Latin-1 locales such as many of the Asiancountries. Programmers can write their 8-bit strings using thefavorite encoding, but are bound to the “unicode-escape” encodingfor Unicode literals.

Proposed Solution

I propose to make the Python source code encoding both visible andchangeable on a per-source file basis by using a special commentat the top of the file to declare the encoding.

To make Python aware of this encoding declaration a number ofconcept changes are necessary with respect to the handling ofPython source code data.

Defining the Encoding

Python will default to ASCII as standard encoding if no otherencoding hints are given.

To define a source code encoding, a magic comment mustbe placed into the source files either as first or secondline in the file, such as:

# coding=<encoding name>

or (using formats recognized by popular editors):

#!/usr/bin/python# -*- coding: <encoding name> -*-

or:

#!/usr/bin/python# vim: set fileencoding=<encoding name> :

More precisely, the first or second line must match the followingregular expression:

^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)

The first group of thisexpression is then interpreted as encoding name. If the encodingis unknown to Python, an error is raised during compilation. Theremust not be any Python statement on the line that contains theencoding declaration. If the first line matches the second lineis ignored.

To aid with platforms such as Windows, which add Unicode BOM marksto the beginning of Unicode files, the UTF-8 signature\xef\xbb\xbf will be interpreted as ‘utf-8’ encoding as well(even if no magic encoding comment is given).

If a source file uses both the UTF-8 BOM mark signature and amagic encoding comment, the only allowed encoding for the commentis ‘utf-8’. Any other encoding will cause an error.

Examples

These are some examples to clarify the different styles fordefining the source code encoding at the top of a Python sourcefile:

With interpreter binary and using Emacs style file encodingcomment:

#!/usr/bin/python# -*- coding: latin-1 -*-importos,sys...#!/usr/bin/python# -*- coding: iso-8859-15 -*-importos,sys...#!/usr/bin/python# -*- coding: ascii -*-importos,sys...

Without interpreter line, using plain text:

# This Python file uses the following encoding: utf-8importos,sys...

Text editors might have different ways of defining the file’sencoding, e.g.:
```
#!/usr/local/bin/python# coding: latin-1importos,sys...
```
Without encoding comment, Python’s parser will assume ASCIItext:
```
#!/usr/local/bin/pythonimportos,sys...
```

Encoding comments which don’t work:

Missing “coding:” prefix:

#!/usr/local/bin/python# latin-1importos,sys...

Encoding comment not on line 1 or 2:

#!/usr/local/bin/python## -*- coding: latin-1 -*-importos,sys...

Unsupported encoding:

#!/usr/local/bin/python# -*- coding: utf-42 -*-importos,sys...

Concepts

The PEP is based on the following concepts which would have to beimplemented to enable usage of such a magic comment:

The complete Python source file should use a single encoding.Embedding of differently encoded data is not allowed and willresult in a decoding error during compilation of the Pythonsource code.
Any encoding which allows processing the first two lines in theway indicated above is allowed as source code encoding, thisincludes ASCII compatible encodings as well as certainmulti-byte encodings such as Shift_JIS. It does not includeencodings which use two or more bytes for all characters likee.g. UTF-16. The reason for this is to keep the encodingdetection algorithm in the tokenizer simple.
Handling of escape sequences should continue to work as it doesnow, but with all possible source code encodings, that isstandard string literals (both 8-bit and Unicode) are subject toescape sequence expansion while raw string literals only expanda very small subset of escape sequences.
Python’s tokenizer/compiler combo will need to be updated towork as follows:
1. read the file
2. decode it into Unicode assuming a fixed per-file encoding
3. convert it into a UTF-8 byte string
4. tokenize the UTF-8 content
5. compile it, creating Unicode objects from the given Unicode dataand creating string objects from the Unicode literal databy first reencoding the UTF-8 data into 8-bit string datausing the given file encoding

Note that Python identifiers are restricted to the ASCIIsubset of the encoding, and thus need no further conversionafter step 4.

Implementation

For backwards-compatibility with existing code which currentlyuses non-ASCII in string literals without declaring an encoding,the implementation will be introduced in two phases:

Allow non-ASCII in string literals and comments, by internallytreating a missing encoding declaration as a declaration of“iso-8859-1”. This will cause arbitrary byte strings tocorrectly round-trip between step 2 and step 5 of theprocessing, and provide compatibility with Python 2.2 forUnicode literals that contain non-ASCII bytes.
A warning will be issued if non-ASCII bytes are found in theinput, once per improperly encoded input file.
Remove the warning, and change the default encoding to “ascii”.

The builtincompile() API will be enhanced to accept Unicode asinput. 8-bit string input is subject to the standard procedure forencoding detection as described above.

If a Unicode string with a coding declaration is passed tocompile(),aSyntaxError will be raised.

SUZUKI Hisao is working on a patch; see[2] for details. A patchimplementing only phase 1 is available at[1].

[2]

Phase 2 implementation:https://bugs.python.org/issue534304

History

1.10 and above: see CVS history
1.8: Added ‘.’ to the coding RE.
1.7: Added warnings to phase 1 implementation. Replaced theLatin-1 default encoding with the interpreter’s defaultencoding. Added tweaks tocompile().
1.4 - 1.6: Minor tweaks
1.3: Worked in comments by Martin v. Loewis:UTF-8 BOM mark detection, Emacs style magic comment,two phase approach to the implementation

Copyright

This document has been placed in the public domain.

Source:https://github.com/python/peps/blob/main/peps/pep-0263.rst

Last modified:2025-02-01 08:55:40 GMT

Movatterモバイル変換