Movatterモバイル変換


[0]ホーム

URL:


ContentsMenuExpandLight modeDark modeAuto light/dark modeAuto light/dark, in light modeAuto light/dark, in dark modeSkip to content
Alyssa Coghlan's Python Notes
Alyssa Coghlan's Python Notes
Back to top

Processing Text Files in Python 3

A recent discussion on the python-ideas mailing list made it clear that we(i.e. the core Python developers) need to provide some clearer guidance onhow to handle text processing tasks that trigger exceptions by default inPython 3, but were previously swept under the rug by Python 2’s blitheassumption that all files are encoded in “latin-1”.

While we’ll have something in the official docsbefore too long, this ismy own preliminary attempt at summarising the options for processing textfiles, and the various trade-offs between them.

What changed in Python 3?

The obvious question to ask is what changed in Python 3 so that the commonapproaches that developers used to use for text processing in Python 2 havenow started to throwUnicodeDecodeError andUnicodeEncodeError inPython 3.

The key difference is that the default text processing behaviour in Python 3aims to detect text encoding problems as early as possible - either whenreading improperly encoded text (indicated byUnicodeDecodeError) or whenbeing asked to write out a text sequence that cannot be correctly representedin the target encoding (indicated byUnicodeEncodeError).

This contrasts with the Python 2 approach which allowed data corruption bydefault and strict correctness checks had to be requested explicitly. Thatcould certainly beconvenient when the data being processed waspredominantly ASCII text, and the occasional bit of data corruption wasunlikely to be even detected, let alone cause problems, but it’s hardly asolid foundation for building robust multilingual applications (as anyonethat has ever had to track down an errantUnicodeError in Python 2 willknow).

However, Python 3 does provide a number of mechanisms for relaxing the defaultstrict checks in order to handle various text processing use cases (inparticular, use cases where “best effort” processing is acceptable, and strictcorrectness is not required). This article aims to explain some of them bylooking at cases where it would be appropriate to use them.

Note that many of the features I discuss below are available in Python 2as well, but you have to explicitly access them via theunicode typeand thecodecs module. In Python 3, they’re part of the behaviour ofthestr type and theopen builtin.

Unicode Basics

To process text effectively in Python 3, it’s necessary to learn at least atiny amount about Unicode and text encodings:

  1. Python 3 always stores text strings as sequences of Unicodecode points.These are values in the range 0-0x10FFFF. Theydon’t always corresponddirectly to the characters you read on your screen, but that distinctiondoesn’t matter for most text manipulation tasks.

  2. To store text as binary data, you must specify anencoding for that text.

  3. The process of converting from a sequence of bytes (i.e. binary data)to a sequence of code points (i.e. text data) isdecoding, while thereverse process isencoding.

  4. For historical reasons, the most widely used encoding isascii, whichcan only handle Unicode code points in the range 0-0x7F (i.e. ASCII is a7-bit encoding).

  5. There are a wide variety of ASCIIcompatible encodings, which ensure thatany appearance of a valid ASCII value in the binary data refers to thecorresponding ASCII character.

  6. “utf-8” is becoming the preferred encoding for many applications, as it isan ASCII-compatible encoding that can encode any valid Unicode code point.

  7. “latin-1” is another significant ASCII-compatible encoding, as it maps bytevalues directly to the first 256 Unicode code points. (Note that Windowshas it’s own “latin-1” variant called cp1252, but, unlike the ISO“latin-1” implemented by the Python codec with that name, the Windowsspecific variant doesn’t map all 256 possible byte values)

  8. There are also many ASCIIincompatible encodings in widespread use,particularly in Asian countries (which had to devise their own solutions beforethe rise of Unicode) and on platforms such as Windows, Java and the .NET CLR,where many APIs accept text as UTF-16 encoded data.

  9. Thelocale.getpreferredencoding() call reports the encoding that Pythonwill use by default for most operations that require an encoding (e.g.reading in a text file without a specified encoding). This is designed toaid interoperability between Python and the host operating system, but cancause problems with interoperability between systems (if encoding issuesare not managed consistently).

  10. Thesys.getfilesystemencoding() call reports the encoding that Pythonwill use by default for most operations that both require an encoding andinvolve textual metadata in the filesystem (e.g. determining the resultsofos.listdir())

  11. If you’re a native English speaker residing in an English speaking country(like me!) it’s tempting to think “but Python 2 works fine, why are youbothering me with all this Unicode malarkey?”. It’s worth trying to rememberthat we’re actually a minority on this planet and, for most people on Earth,ASCII andlatin-1 can’t even handle theirname, let alone any othertext they might want to write or process in their native language.

Unicode Error Handlers

To help standardise various techniques for dealing with Unicode encoding anddecoding errors, Python includes a concept of Unicode error handlers thatare automatically invoked whenever a problem is encountered in the processof encoding or decoding text.

I’m not going to cover all of them in this article, but three are ofparticular significance:

  • strict: this is the default error handler that just raisesUnicodeDecodeError for decoding problems andUnicodeEncodeError forencoding problems.

  • surrogateescape: this is the error handler that Python uses for mostOS facing APIs to gracefully cope with encoding problems in the datasupplied by the OS. It handles decoding errors by squirreling the data awayin a little used part of the Unicode code point space (For those interestedin more detail, seePEP 383). When encoding, it translates those hiddenaway values back into the exact original byte sequence that failed todecode correctly. Just as this is useful for OS APIs, it can make it easierto gracefully handle encoding problems in other contexts.

  • backslashreplace: this is an encoding error handler that convertscode points that can’t be represented in the target encoding to theequivalent Python string numeric escape sequence. It makes it easy toensure thatUnicodeEncodeError will never be thrown, but doesn’t losemuch information while doing so losing (since we don’t want encodingproblems hiding error output, this error handler is enabled onsys.stderr by default).

The Binary Option

One alternative that is always available is to open files in binary mode andprocess them as bytes rather than as text. This can work in many cases,especially those where the ASCII markers are embedded in genuinely arbitrarybinary data.

However, for both “text data with unknown encoding” and “text data with knownencoding, but potentially containing encoding errors”, it is oftenpreferable to get them into a form that can be handled as text strings. Inparticular, some APIs that accept both bytes and text may be very strictabout the encoding of the bytes they accept (for example, theurllib.urlparse module accepts only pure ASCII data for processing asbytes, but will happily process text strings containing non-ASCIIcode points).

Text File Processing

This section explores a number of use cases that can arise when processingtext. Text encoding is a sufficiently complex topic that there’s no onesize fits all answer - the right answer for a given application will dependon factors like:

  • how much control you have over the text encodings used

  • whether avoiding program failure is more important than avoiding datacorruption or vice-versa

  • how common encoding errors are expected to be, and whether they need tobe handled gracefully or can simply be rejected as invalid input

Files in an ASCII compatible encoding, best effort is acceptable

Use case: the files to be processed are in an ASCII compatible encoding,but you don’t know exactly which one.All files must be processed withouttriggering any exceptions, but some risk of data corruption is deemedacceptable (e.g. collating log files from multiple sources where somedata errors are acceptable, so long as the logs remain largely intact).

Approach: use the “latin-1” encoding to map byte values directly to thefirst 256 Unicode code points. This is the closest equivalent Python 3offers to the permissive Python 2 text handling model.

Example:f=open(fname,encoding="latin-1")

Note

While the Windowscp1252 encoding is also sometimes referred to as“latin-1”, it doesn’t map all possible byte values, and thus needsto be used in combination with thesurrogateescape error handler toensure it never throwsUnicodeDecodeError. Thelatin-1 encodingin Python implements ISO_8859-1:1987 which maps all possible byte valuesto the first 256 Unicode code points, and thus ensures decoding errorswill never occur regardless of the configured error handler.

Consequences:

  • data willnot be corrupted if it is simply read in, processed as ASCIItext, and written back out again.

  • will never raise UnicodeDecodeError when reading data

  • will still raise UnicodeEncodeError if codepoints above 0xFF (e.g. smartquotes copied from a word processing program) are added to the text stringbefore it is encoded back to bytes. To prevent such errors, use thebackslashreplace error handler (or one of the other error handlersthat replaces Unicode code points without a representation in the targetencoding with sequences of ASCII code points).

  • data corruption may occur if the source data is in an ASCII incompatibleencoding (e.g. UTF-16)

  • corruption may occur if data is written back out using an encoding otherthanlatin-1

  • corruption may occur if the non-ASCII elements of the string are modifieddirectly (e.g. for a variable width encoding like UTF-8 that has beendecoded aslatin-1 instead, slicing the string at an arbitrary pointmay split a multi-byte character into two pieces)

Files in an ASCII compatible encoding, minimise risk of data corruption

Use case: the files to be processed are in an ASCII compatible encoding,but you don’t know exactly which one.All files must be processed withouttriggering any exceptions, but some Unicode related errors are acceptable inorder to reduce the risk of data corruption (e.g. collating log files frommultiple sources, but wanting more explicit notification when the collateddata is at risk of corruption due to programming errors that violate theassumption of writing the data back out only in its original encoding)

Approach: use theascii encoding with thesurrogateescape errorhandler.

Example:f=open(fname,encoding="ascii",errors="surrogateescape")

Consequences:

  • data willnot be corrupted if it is simply read in, processed as ASCIItext, and written back out again.

  • will never raise UnicodeDecodeError when reading data

  • will still raise UnicodeEncodeError if codepoints above 0xFF (e.g. smartquotes copied from a word processing program) are added to the text stringbefore it is encoded back to bytes. To prevent such errors, use thebackslashreplace error handler (or one of the other error handlersthat replaces Unicode code points without a representation in the targetencoding with sequences of ASCII code points).

  • will also raise UnicodeEncodeError if an attempt is made to encode a textstring containing escaped bytes values without enabling thesurrogateescape error handler (or an even more tolerant handler likebackslashreplace).

  • some Unicode processing libraries that ensure a code point sequence isvalid text may complain about the escaping mechanism used (I’m not goingto explain what it means here, but the phrase “lone surrogate” is a hintthat something along those lines may be happening - the fact that“surrogate” also appears in the name of the error handler is not acoincidence).

  • data corruption may still occur if the source data is in an ASCIIincompatible encoding (e.g. UTF-16)

  • data corruption is also still possible if the escaped portions of thestring are modified directly

Files in a typical platform specific encoding

Use case: the files to be processed are in a consistent encoding, theencoding can be determined from the OS details and locale settings and itis acceptable to refuse to process files that are not properly encoded.

Approach: simply open the file in text mode. This use case describes thedefault behaviour in Python 3.

Example:f=open(fname)

Consequences:

  • UnicodeDecodeError may be thrown when reading such files (if the data is notactually in the encoding returned bylocale.getpreferredencoding())

  • UnicodeEncodeError may be thrown when writing such files (if attempting towrite out code points which have no representation in the target encoding).

  • thesurrogateescape error handler can be used to be more tolerant ofencoding errors if it is necessary to make a best effort attempt to processfiles that contain such errors instead of rejecting them outright as invalidinput.

Files in a consistent, known encoding

Use case: the files to be processed are nominally in a consistentencoding, you know the exact encoding in advance and it is acceptable torefuse to process files that are not properly encoded. This is becoming moreand more common, especially with many text file formats beginning tostandardise on UTF-8 as the preferred text encoding.

Approach: open the file in text mode with the appropriate encoding

Example:f=open(fname,encoding="utf-8")

Consequences:

  • UnicodeDecodeError may be thrown when reading such files (if the data is notactually in the specified encoding)

  • UnicodeEncodeError may be thrown when writing such files (if attempting towrite out code points which have no representation in the target encoding).

  • thesurrogateescape error handler can be used to be more tolerant ofencoding errors if it is necessary to make a best effort attempt to processfiles that contain such errors instead of rejecting them outright as invalidinput.

Files with a reliable encoding marker

Use case: the files to be processed include markers that specify thenominal encoding (with a default encoding assumed if no marker is present)and it is acceptable to refuse to process files that are not properly encoded.

Approach: first open the file in binary mode to look for the encodingmarker, then reopen in text mode with the identified encoding.

Example:f=tokenize.open(fname) uses PEP 263 encoding markers todetect the encoding of Python source files (defaulting to UTF-8 if noencoding marker is detected)

Consequences:

  • can handle files in different encodings

  • may still raise UnicodeDecodeError if the encoding marker is incorrect

  • must ensure marker is set correctly when writing such files

  • even if it is not the default encoding, individual files can still beset to use UTF-8 as the encoding in order to support encoding almostall Unicode code points

  • thesurrogateescape error handler can be used to be more tolerant ofencoding errors if it is necessary to make a best effort attempt to processfiles that contain such errors instead of rejecting them outright as invalidinput.

On this page

[8]ページ先頭

©2009-2025 Movatter.jp