Python Enhancement Proposals

Python »
PEP Index »
PEP 293

PEP 293 – Codec Error Handling Callbacks

Author:: Walter Dörwald <walter at livinglogic.de>
Status:

Abstract

This PEP aims at extending Python’s fixed codec error handlingschemes with a more flexible callback based approach.

Python currently uses a fixed error handling for codec errorhandlers. This PEP describes a mechanism which allows Python touse function callbacks as error handlers. With these moreflexible error handlers it is possible to add new functionality toexisting codecs by e.g. providing fallback solutions or differentencodings for cases where the standard codec mapping does notapply.

Specification

Currently the set of codec error handling algorithms is fixed toeither “strict”, “replace” or “ignore” and the semantics of thesealgorithms is implemented separately for each codec.

The proposed patch will make the set of error handling algorithmsextensible through a codec error handler registry which mapshandler names to handler functions. This registry consists of thefollowing two C functions:

intPyCodec_RegisterError(constchar*name,PyObject*error)PyObject*PyCodec_LookupError(constchar*name)

and their Python counterparts:

codecs.register_error(name,error)codecs.lookup_error(name)

PyCodec_LookupError raises aLookupError if no callback functionhas been registered under this name.

Similar to the encoding name registry there is no way ofunregistering callback functions or iterating through theavailable functions.

The callback functions will be used in the following way by thecodecs: when the codec encounters an encoding/decoding error, thecallback function is looked up by name, the information about theerror is stored in an exception object and the callback is calledwith this object. The callback returns information about how toproceed (or raises an exception).

For encoding, the exception object will look like this:

classUnicodeEncodeError(UnicodeError):def__init__(self,encoding,object,start,end,reason):UnicodeError.__init__(self,"encoding '%s' can't encode characters "+"in positions%d-%d:%s"%(encoding,start,end-1,reason))self.encoding=encodingself.object=objectself.start=startself.end=endself.reason=reason

This type will be implemented in C with the appropriate setter andgetter methods for the attributes, which have the followingmeaning:

encoding: The name of the encoding;
object: The original unicode object for whichencode() hasbeen called;
start: The position of the first unencodable character;
end: (The position of the last unencodable character)+1 (orthe length of object, if all characters from start to the endof object are unencodable);
reason: The reason whyobject[start:end] couldn’t be encoded.

If object has consecutive unencodable characters, the encodershould collect those characters for one call to the callback ifthose characters can’t be encoded for the same reason. Theencoder is not required to implement this behaviour but may callthe callback for every single character, but it is stronglysuggested that the collecting method is implemented.

The callback must not modify the exception object. If thecallback does not raise an exception (either the one passed in, ora different one), it must return a tuple:

(replacement,newpos)

replacement is a unicode object that the encoder will encode andemit instead of the unencodableobject[start:end] part, newposspecifies a new position within object, where (after encoding thereplacement) the encoder will continue encoding.

Negative values for newpos are treated as being relative toend of object. If newpos is out of bounds the encoder will raiseanIndexError.

If the replacement string itself contains an unencodable characterthe encoder raises the exception object (but may set a differentreason string before raising).

Should further encoding errors occur, the encoder is allowed toreuse the exception object for the next call to the callback.Furthermore, the encoder is allowed to cache the result ofcodecs.lookup_error.

If the callback does not know how to handle the exception, it mustraise aTypeError.

Decoding works similar to encoding with the following differences:

The exception class is namedUnicodeDecodeError and the attributeobject is the original 8bit string that the decoder is currentlydecoding.
The decoder will call the callback with those bytes thatconstitute one undecodable sequence, even if there is more thanone undecodable sequence that is undecodable for the same reasondirectly after the first one. E.g. for the “unicode-escape”encoding, when decoding the illegal string\\u00\\u01x, thecallback will be called twice (once for\\u00 and once for\\u01). This is done to be able to generate the correct numberof replacement characters.
The replacement returned from the callback is a unicode objectthat will be emitted by the decoder as-is without furtherprocessing instead of the undecodableobject[start:end] part.

There is a third API that uses the old strict/ignore/replace errorhandling scheme:

PyUnicode_TranslateCharmap/unicode.translate

The proposed patch will enhancePyUnicode_TranslateCharmap, sothat it also supports the callback registry. This has theadditional side effect thatPyUnicode_TranslateCharmap willsupport multi-character replacement strings (see SF featurerequest #403100[1]).

ForPyUnicode_TranslateCharmap the exception class will be namedUnicodeTranslateError.PyUnicode_TranslateCharmap will collectall consecutive untranslatable characters (i.e. those that map toNone) and call the callback with them. The replacement returnedfrom the callback is a unicode object that will be put in thetranslated result as-is, without further processing.

All encoders and decoders are allowed to implement the callbackfunctionality themselves, if they recognize the callback name(i.e. if it is a system callback like “strict”, “replace” and“ignore”). The proposed patch will add two additional systemcallback names: “backslashreplace” and “xmlcharrefreplace”, whichcan be used for encoding and translating and which will also beimplemented in-place for all encoders andPyUnicode_TranslateCharmap.

The Python equivalent of these five callbacks will look like this:

defstrict(exc):raiseexcdefignore(exc):ifisinstance(exc,UnicodeError):return(u"",exc.end)else:raiseTypeError("can't handle%s"%exc.__name__)defreplace(exc):ifisinstance(exc,UnicodeEncodeError):return((exc.end-exc.start)*u"?",exc.end)elifisinstance(exc,UnicodeDecodeError):return(u"\\ufffd",exc.end)elifisinstance(exc,UnicodeTranslateError):return((exc.end-exc.start)*u"\\ufffd",exc.end)else:raiseTypeError("can't handle%s"%exc.__name__)defbackslashreplace(exc):ifisinstance(exc,(UnicodeEncodeError,UnicodeTranslateError)):s=u""forcinexc.object[exc.start:exc.end]:iford(c)<=0xff:s+=u"\\x%02x"%ord(c)eliford(c)<=0xffff:s+=u"\\u%04x"%ord(c)else:s+=u"\\U%08x"%ord(c)return(s,exc.end)else:raiseTypeError("can't handle%s"%exc.__name__)defxmlcharrefreplace(exc):ifisinstance(exc,(UnicodeEncodeError,UnicodeTranslateError)):s=u""forcinexc.object[exc.start:exc.end]:s+=u"&#%d;"%ord(c)return(s,exc.end)else:raiseTypeError("can't handle%s"%exc.__name__)

These five callback handlers will also be accessible to Python ascodecs.strict_error,codecs.ignore_error,codecs.replace_error,codecs.backslashreplace_error andcodecs.xmlcharrefreplace_error.

Rationale

Most legacy encoding do not support the full range of Unicodecharacters. For these cases many high level protocols support away of escaping a Unicode character (e.g. Python itself supportsthe\x,\u and\U convention, XML supports character referencesvia &#xxx; etc.).

When implementing such an encoding algorithm, a problem with thecurrent implementation of the encode method of Unicode objectsbecomes apparent: For determining which characters are unencodableby a certain encoding, every single character has to be tried,because encode does not provide any information about the locationof the error(s), so

# (1)us=u"xxx"s=us.encode(encoding)

has to be replaced by

# (2)us=u"xxx"v=[]forcinus:try:v.append(c.encode(encoding))exceptUnicodeError:v.append("&#%d;"%ord(c))s="".join(v)

This slows down encoding dramatically as now the loop through thestring is done in Python code and no longer in C code.

Furthermore, this solution poses problems with stateful encodings.For example, UTF-16 uses a Byte Order Mark at the start of theencoded byte string to specify the byte order. Using (2) withUTF-16, results in an 8 bit string with a BOM between everycharacter.

To work around this problem, a stream writer - which keeps statebetween calls to the encoding function - has to be used:

# (3)us=u"xxx"importcodecs,cStringIOasStringIOwriter=codecs.getwriter(encoding)v=StringIO.StringIO()uv=writer(v)forcinus:try:uv.write(c)exceptUnicodeError:uv.write(u"&#%d;"%ord(c))s=v.getvalue()

To compare the speed of (1) and (3) the following test script hasbeen used:

# (4)importtimeus=u"äa"*1000000encoding="ascii"importcodecs,cStringIOasStringIOt1=time.time()s1=us.encode(encoding,"replace")t2=time.time()writer=codecs.getwriter(encoding)v=StringIO.StringIO()uv=writer(v)forcinus:try:uv.write(c)exceptUnicodeError:uv.write(u"?")s2=v.getvalue()t3=time.time()assert(s1==s2)print"1:",t2-t1print"2:",t3-t2print"factor:",(t3-t2)/(t2-t1)

On Linux this gives the following output (with Python 2.3a0):

1:0.2743219137192:51.1284689903factor:186.381278466

i.e. (3) is 180 times slower than (1).

Callbacks must be stateless, because as soon as a callback isregistered it is available globally and can be called by multipleencode() calls. To be able to use stateful callbacks, the errorsparameter for encode/decode/translate would have to be changedfromchar* toPyObject*, so that the callback could be useddirectly, without the need to register the callback globally. Asthis requires changes to lots of C prototypes, this approach wasrejected.

Currently all encoding/decoding functions have arguments

constPy_UNICODE*p,intsize

constchar*p,intsize

to specify the unicode characters/8bit characters to beencoded/decoded. So in case of an error the codec has to create anew unicode or str object from these parameters and store it inthe exception object. The callers of these encoding/decodingfunctions extract these parameters from str/unicode objectsthemselves most of the time, so it could speed up error handlingif these object were passed directly. As this again requireschanges to many C functions, this approach has been rejected.

For stream readers/writers the errors attribute must be changeableto be able to switch between different error handling methodsduring the lifetime of the stream reader/writer. This is currentlythe case forcodecs.StreamReader andcodecs.StreamWriter andall their subclasses. All core codecs and probably most of thethird party codecs (e.g.JapaneseCodecs) derive their streamreaders/writers from these classes so this already works,but the attribute errors should be documented as a requirement.

Implementation Notes

A sample implementation is available as SourceForge patch #432401[2] including a script for testing the speed of variousstring/encoding/error combinations and a test script.

Currently the new exception classes are old style Pythonclasses. This means that accessing attributes resultsin a dict lookup. The C API is implemented in a waythat makes it possible to switch to new style classesbehind the scene, ifException (andUnicodeError) willbe changed to new style classes implemented in C forimproved performance.

The classcodecs.StreamReaderWriter uses the errors parameter forboth reading and writing. To be more flexible this shouldprobably be changed to two separate parameters for reading andwriting.

The errors parameter ofPyUnicode_TranslateCharmap is notavailably to Python, which makes testing of the new functionalityofPyUnicode_TranslateCharmap impossible with Python scripts. Thepatch should add an optional argument errors to unicode.translateto expose the functionality and make testing possible.

Codecs that do something different than encoding/decoding from/tounicode and want to use the new machinery can define their ownexception classes and the strict handlers will automatically workwith it. The other predefined error handlers are unicode specificand expect to get aUnicode(Encode|Decode|Translate)Errorexception object so they won’t work.

Backwards Compatibility

The semantics of unicode.encode with errors=”replace” has changed:The old version always stored a ? character in the output stringeven if no character was mapped to ? in the mapping. With theproposed patch, the replacement string from the callback willagain be looked up in the mapping dictionary. But as allsupported encodings are ASCII based, and thus map ? to ?, thisshould not be a problem in practice.

Illegal values for the errors argument raisedValueError before,now they will raiseLookupError.

References

[1]

SF feature request #403100“Multicharacter replacements in PyUnicode_TranslateCharmap”https://bugs.python.org/issue403100

[2]

SF patch #432401 “unicode encoding error callbacks”https://bugs.python.org/issue432401

Copyright

This document has been placed in the public domain.

Source:https://github.com/python/peps/blob/main/peps/pep-0293.rst

Last modified:2025-02-01 08:59:27 GMT

Movatterモバイル変換