This PEP proposes to introduce a syntax to declare the encoding ofa Python source file. The encoding information is then used by thePython parser to interpret the file using the given encoding. Mostnotably this enhances the interpretation of Unicode literals inthe source code and makes it possible to write Unicode literalsusing e.g. UTF-8 directly in an Unicode aware editor.
In Python 2.1, Unicode literals can only be written using theLatin-1 based encoding “unicode-escape”. This makes theprogramming environment rather unfriendly to Python users who liveand work in non-Latin-1 locales such as many of the Asiancountries. Programmers can write their 8-bit strings using thefavorite encoding, but are bound to the “unicode-escape” encodingfor Unicode literals.
I propose to make the Python source code encoding both visible andchangeable on a per-source file basis by using a special commentat the top of the file to declare the encoding.
To make Python aware of this encoding declaration a number ofconcept changes are necessary with respect to the handling ofPython source code data.
Python will default to ASCII as standard encoding if no otherencoding hints are given.
To define a source code encoding, a magic comment mustbe placed into the source files either as first or secondline in the file, such as:
# coding=<encoding name>or (using formats recognized by popular editors):
#!/usr/bin/python# -*- coding: <encoding name> -*-
or:
#!/usr/bin/python# vim: set fileencoding=<encoding name> :
More precisely, the first or second line must match the followingregular expression:
^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)
The first group of thisexpression is then interpreted as encoding name. If the encodingis unknown to Python, an error is raised during compilation. Theremust not be any Python statement on the line that contains theencoding declaration. If the first line matches the second lineis ignored.
To aid with platforms such as Windows, which add Unicode BOM marksto the beginning of Unicode files, the UTF-8 signature\xef\xbb\xbf will be interpreted as ‘utf-8’ encoding as well(even if no magic encoding comment is given).
If a source file uses both the UTF-8 BOM mark signature and amagic encoding comment, the only allowed encoding for the commentis ‘utf-8’. Any other encoding will cause an error.
These are some examples to clarify the different styles fordefining the source code encoding at the top of a Python sourcefile:
#!/usr/bin/python# -*- coding: latin-1 -*-importos,sys...#!/usr/bin/python# -*- coding: iso-8859-15 -*-importos,sys...#!/usr/bin/python# -*- coding: ascii -*-importos,sys...
# This Python file uses the following encoding: utf-8importos,sys...
#!/usr/local/bin/python# coding: latin-1importos,sys...
#!/usr/local/bin/pythonimportos,sys...
#!/usr/local/bin/python# latin-1importos,sys...
#!/usr/local/bin/python## -*- coding: latin-1 -*-importos,sys...
#!/usr/local/bin/python# -*- coding: utf-42 -*-importos,sys...
The PEP is based on the following concepts which would have to beimplemented to enable usage of such a magic comment:
Any encoding which allows processing the first two lines in theway indicated above is allowed as source code encoding, thisincludes ASCII compatible encodings as well as certainmulti-byte encodings such as Shift_JIS. It does not includeencodings which use two or more bytes for all characters likee.g. UTF-16. The reason for this is to keep the encodingdetection algorithm in the tokenizer simple.
Note that Python identifiers are restricted to the ASCIIsubset of the encoding, and thus need no further conversionafter step 4.
For backwards-compatibility with existing code which currentlyuses non-ASCII in string literals without declaring an encoding,the implementation will be introduced in two phases:
A warning will be issued if non-ASCII bytes are found in theinput, once per improperly encoded input file.
The builtincompile() API will be enhanced to accept Unicode asinput. 8-bit string input is subject to the standard procedure forencoding detection as described above.
If a Unicode string with a coding declaration is passed tocompile(),aSyntaxError will be raised.
SUZUKI Hisao is working on a patch; see[2] for details. A patchimplementing only phase 1 is available at[1].
Implementation of steps 1 and 2 above were completed in 2.3,except for changing the default encoding to “ascii”.
The default encoding was set to “ascii” in version 2.5.
This PEP intends to provide an upgrade path from the current(more-or-less) undefined source code encoding situation to a morerobust and portable definition.
compile().This document has been placed in the public domain.
Source:https://github.com/python/peps/blob/main/peps/pep-0263.rst
Last modified:2025-02-01 08:55:40 GMT