Python Enhancement Proposals

Python »
PEP Index »
PEP 672

PEP 672 – Unicode-related Security Considerations for Python

Author:: Petr Viktorin <encukou at gmail.com>
Status:

Abstract

This document explains possible ways to misuse Unicode to write Pythonprograms that appear to do something else than they actually do.

This document does not give any recommendations and solutions.

Unicode is a system for handling all kinds of written language.It aims to allow any character from any human language to beused. Python code may consist of almost all valid Unicode characters.While this allows programmers from all around the world to express themselves,it also allows writing code that is potentially confusing to readers.

It is possible to misuse Python’s Unicode-related features to write code thatappears to do something else than what it does.Evildoers could take advantage of this to trick code reviewers intoaccepting malicious code.

The possible issues generally can’t be solved in Python itself withoutexcessive restrictions of the language.They should be solved in code editors and review tools(such asdiff displays), by enforcing project-specific policies,and by raising awareness of individual programmers.

This document purposefully does not give any solutionsor recommendations: it is rather a list of things to keep in mind.

This document is specific to Python.For general security considerations in Unicode text and source code,see Unicode technical reports[tr36],[tr39], and[tr55].(Note that Python does not necessarily conform to these specifications.)

Acknowledgement

Investigation for this document was prompted byCVE-2021-42574,Trojan Source Attacks, reported by Nicholas Boucher and Ross Anderson,which focuses on Bidirectional override characters and homoglyphs in a varietyof programming languages.

Confusing Features

This section lists some Unicode-related features that can be surprisingor misusable.

ASCII-only Considerations

ASCII is a subset of Unicode, consisting of the most common symbols, numbers,Latin letters and control characters.

While issues with the ASCII character set are generally well understood,the’re presented here to help better understanding of the non-ASCII cases.

Confusables and Typos

Some characters look alike.Before the age of computers, many mechanical typewriters lacked the keys forthe digits0 and1: users typedO (capital o) andl(lowercase L) instead. Human readers could tell them apart by context only.In programming languages, however, distinction between digits and letters iscritical – and most fonts designed for programmers make it easy to tell themapart.

Similarly, in fonts designed for human languages, the uppercase “I” andlowercase “l” can look similar. Or the letters “rn” may be virtuallyindistinguishable from the single letter “m”.Again, programmers’ fonts make these pairs ofconfusablesnoticeably different.

However, what is “noticeably” different always depends on the context.Humans tend to ignore details in longer identifiers: the variable nameaccessibi1ity_options can still look indistinguishable fromaccessibility_options, while they are distinct for the compiler.The same can be said for plain typos: most humans will not notice the typo inresponsbility_chain_delegate.

Control Characters

Python generally considers allCR (\r),LF (\n), andCR-LFpairs (\r\n) as an end of line characters.Most code editors do as well, but there are editors that display “non-native”line endings as unknown characters (or nothing at all), rather than endingthe line, displaying this example:

# Don't call this function:fire_the_missiles()

as a harmless comment like:

# Don't call this function:⬛fire_the_missiles()

CPython may treat the control character NUL (\0) as end of input,but many editors simply skip it, possibly showing code that Python will notrun as a regular part of a file.

Some characters can be used to hide/overwrite other characters when source islisted in common terminals. For example:

BS (\b, Backspace) moves the cursor back, so the character after itwill overwrite the character before.
CR (\r, carriage return) moves the cursor to the start of line,subsequent characters overwrite the start of the line.
SUB (\x1A, Ctrl+Z) means “End of text” on Windows. Some programs(such astype) ignore the rest of the file after it.
ESC (\x1B) commonly initiates escape codes which allow arbitrarycontrol of the terminal.

Confusable Characters in Identifiers

Python is not limited to ASCII.It allows characters of all scripts – Latin letters to ancient Egyptianhieroglyphs – in identifiers (such as variable names).SeePEP 3131 for details and rationale.Only “letters and numbers” are allowed, so whileγάτα is a valid Pythonidentifier,🐱 is not. (SeeIdentifiers and keywords for details.)

Non-printing control characters are also not allowed in identifiers.

However, within the allowed set there is a large number of “confusables”.For example, the uppercase versions of the Latinb, Greekβ (Beta), andCyrillicв (Ve) often look identical:B,Β andВ, respectively.

This allows identifiers that look the same to humans, but not to Python.For example, all of the following are distinct identifiers:

scope (Latin, ASCII-only)
scоpe (with a Cyrillicо)
scοpe (with a Greekο)
ѕсоре (all Cyrillic letters)

Additionally, some letters can look like non-letters:

The letter for the Hawaiianʻokina looks like an apostrophe;ʻHelloʻ is a Python identifier, not a string.
The East Asian word forten looks like a plus sign,so十=10 is a complete Python statement. (The “十” is a word: “ten”rather than “10”.)

Note

The converse also applies – some symbols look like letters – but sincePython does not allow arbitrary symbols in identifiers, this is not anissue.

Confusable Digits

Numeric literals in Python only use the ASCII digits 0-9 (and non-digits suchas. ore).

However, when numbers are converted from strings, such as in theint andfloat constructors or by thestr.format method, any decimal digitcan be used. For example߅ (NKODIGITFIVE) or௫(TAMILDIGITFIVE) work as the digit5.

Some scripts include digits that look similar to ASCII ones, but have adifferent value. For example:

>>>int('৪୨')42>>>'{٥}'.format('zero','one','two','three','four','five')five

Bidirectional Text

Some scripts, such as Hebrew or Arabic, are written right-to-left.Phrases in such scripts interact with nearby text in ways that can besurprising to people who aren’t familiar with these writing systems and theircomputer representation.

The exact process is complicated, and explained in Unicode Standard Annex #9,Unicode Bidirectional Algorithm.

Consider the following code, which assigns a 100-character string tothe variables:

s="X"*100#    "X" is assigned

When theX is replaced by the Hebrew letterא, the line becomes:

s="א"*100#    "א" is assigned

This command still assigns a 100-character string tos, butwhen displayed as general text following the Bidirectional Algorithm(e.g. in a browser), it appears ass="א" followed by a comment.

Other surprising examples include:

In the statementערך=23, the variableערך is set to the integer 23.
In the statementقيمة=ערך, the variableقيمة is setto the value ofערך.
In the statementقيمة-(ערך**2), the value ofערך is squared andthen subtracted fromقيمة.Theopening parenthesis is displayed as).

Bidirectional Marks, Embeddings, Overrides and Isolates

Default reordering rules do not always yield the intended direction of text, soUnicode provides several ways to alter it.

The most basic aredirectional marks, which are invisible but affect textas a left-to-right (or right-to-left) character would.Continuing with thes="X" example above, in the next example theX isreplaced by the Latinx followed or preceded by aright-to-left mark (U+200F). This assigns a 200-character string tos(100 copies ofx interspersed with 100 invisible marks),but under Unicode rules for general text, it is rendered ass="x"followed by an ASCII-only comment:

s="x‏"*100#    "‏x" is assigned

The directionalembedding,override andisolate charactersare also invisible, but affect the ordering of all text after them until eitherended by a dedicated character, or until the end of line.(Unicode specifies the effect to last until the end of a “paragraph” (seeUnicode Bidirectional Algorithm),but allows tools to interpret newline characters as paragraph ends(see UnicodeNewline Guidelines). Most code editors and terminals do so.)

These characters essentially allow arbitrary reordering of the text thatfollows them. Python only allows them in strings and comments, which does limittheir potential (especially in combination with the fact that Python’s commentsalways extend to the end of a line), but it doesn’t render them harmless.

Normalizing identifiers

Python strings are collections ofUnicode codepoints, not “characters”.

For reasons like compatibility with earlier encodings, Unicode often hasseveral ways to encode what is essentially a single “character”.For example, all these are different ways of writingÅ as a Python string,each of which is unequal to the others.

"\N{LATINCAPITALLETTERAWITHRINGABOVE}" (1 codepoint)
"\N{LATINCAPITALLETTERA}\N{COMBININGRINGABOVE}" (2 codepoints)
"\N{ANGSTROMSIGN}" (1 codepoint, but different)

For another example, the ligatureﬁ has a dedicated Unicode codepoint,even though it has the same meaning as the two lettersfi.

Also, common letters frequently have several distinct variations.Unicode provides them for contexts where the difference has some semanticmeaning, like mathematics. For example, some variations ofn are:

n (LATIN SMALL LETTER N)
𝐧 (MATHEMATICAL BOLD SMALL N)
𝘯 (MATHEMATICAL SANS-SERIF ITALIC SMALL N)
ｎ (FULLWIDTH LATIN SMALL LETTER N)
ⁿ (SUPERSCRIPT LATIN SMALL LETTER N)

Unicode includes algorithms tonormalize variants like these to a singleform, and Python identifiers are normalized.(There are several normal forms; Python usesNFKC.)

For example,xn andxⁿ are the same identifier in Python:

>>>xⁿ=8>>>xn8

… as isﬁ andfi, and as are the different ways to encodeÅ.

This normalization appliesonly to identifiers, however.Functions that treat strings as identifiers, such asgetattr,do not perform normalization:

>>>classTest:...defﬁnalize(self):...print('OK')...>>>Test().finalize()OK>>>Test().ﬁnalize()OK>>>getattr(Test(),'ﬁnalize')Traceback (most recent call last):...AttributeError:'Test' object has no attribute 'ﬁnalize'

This also applies when importing:

importﬁnalization performs normalization, and looks for a filenamedfinalization.py (and otherfinalization.* files).
importlib.import_module("ﬁnalization") does not normalize,so it looks for a file namedﬁnalization.py.

Some filesystems independently apply normalization and/or case folding.On some systems,ﬁnalization.py,finalization.py andFINALIZATION.py are three distinct filenames; on others, some or allof these name the same file.

Source Encoding

The encoding of Python source files is given by a specific regex on the firsttwo lines of a file, as perEncoding declarations.This mechanism is very liberal in what it accepts, and thus easy to obfuscate.

This can be misused in combination with Python-specific special-purposeencodings (seeText Encodings).For example, withencoding:unicode_escape, characters likequotes or braces can be hidden in an (f-)string, with many tools (syntaxhighlighters, linters, etc.) considering them part of the string.For example:

# For writing Japanese, you don't need an editor that supports# UTF-8 source encoding: unicode_escape sequences work just as well.importosmessage='''This is "Hello World" in Japanese:\u3053\u3093\u306b\u3061\u306f\u7f8e\u3057\u3044\u4e16\u754cThis runs `echo WHOA` in your shell:\u0027\u0027\u0027\u002c\u0028\u006f\u0073\u002e\u0073\u0079\u0073\u0074\u0065\u006d\u0028\u0027\u0065\u0063\u0068\u006f\u0020\u0057\u0048\u004f\u0041\u0027\u0029\u0029\u002c\u0027\u0027\u0027'''

Here,encoding:unicode_escape in the initial comment is an encodingdeclaration. Theunicode_escape encoding instructs Python to treat\u0027 as a single quote (which can start/end a string),\u002c asa comma (punctuator), etc.