Python Enhancement Proposals

Python »
PEP Index »
PEP 400

PEP 400 – Deprecate codecs.StreamReader and codecs.StreamWriter

Author:: Victor Stinner <vstinner at python.org>
Status:

Abstract

io.TextIOWrapper and codecs.StreamReaderWriter offer the same API[1]. TextIOWrapper has more features and is faster thanStreamReaderWriter. Duplicate code means that bugs should be fixedtwice and that we may have subtle differences between the twoimplementations.

The codecs module was introduced in Python 2.0 (see thePEP 100).The io module wasintroduced in Python 2.6 and 3.0 (see thePEP 3116),and reimplemented in C inPython 2.7 and 3.1.

PEP Deferral

Further exploration of the concepts covered in this PEP has been deferredfor lack of a current champion interested in promoting the goals of the PEPand collecting and incorporating feedback, and with sufficient availabletime to do so effectively.

Motivation

When the Python I/O model was updated for 3.0, the concept of a“stream-with-known-encoding” was introduced in the form ofio.TextIOWrapper. As this class is critical to the performance oftext-based I/O in Python 3, this module has an optimised C versionwhich is used by CPython by default. Many corner cases in handlingbuffering, stateful codecs and universal newlines have been dealt withsince the release of Python 3.0.

This new interface overlaps heavily with the legacycodecs.StreamReader, codecs.StreamWriter and codecs.StreamReaderWriterinterfaces that were part of the original codec interface design inPEP 100. These interfaces are organised around the principle of anencoding with an associated stream (i.e. the reverse of arrangement inthe io module), so the originalPEP 100 design required that codecwriters provide appropriate StreamReader and StreamWriterimplementations in addition to the core codec encode() and decode()methods. This places a heavy burden on codec authors providing thesespecialised implementations to correctly handle many of the cornercases (seeAppendix A) that have now been dealt with by io.TextIOWrapper. While deeperintegration between the codec and the stream allows for additionaloptimisations in theory, these optimisations have in practice eithernot been carried out and else the associated code duplication meansthat the corner cases that have been fixed in io.TextIOWrapper arestill not handled correctly in the various StreamReader andStreamWriter implementations.

Accordingly, this PEP proposes that:

codecs.open() be updated to delegate to the builtin open() in Python3.3;
the legacy codecs.Stream* interfaces, including the streamreader andstreamwriter attributes of codecs.CodecInfo be deprecated in Python3.3.

Rationale

StreamReader and StreamWriter issues

StreamReader is unable to translate newlines.
StreamWriter doesn’t support “line buffering” (flush if the inputtext contains a newline).
StreamReader classes of the CJK encodings (e.g. GB18030) onlysupports UNIX newlines (’\n’).
StreamReader and StreamWriter are stateful codecs but don’t exposefunctions to control their state (getstate() or setstate()). Eachcodec has to handle corner cases, seeAppendix A.
StreamReader and StreamWriter are very similar to IncrementalReaderand IncrementalEncoder, some code is duplicated for stateful codecs(e.g. UTF-16).
Each codec has to reimplement its own StreamReader and StreamWriterclass, even if it’s trivial (just call the encoder/decoder).
codecs.open(filename, “r”) creates an io.TextIOWrapper object.
No codec implements an optimized method in StreamReader orStreamWriter based on the specificities of the codec.

Issues in the bug tracker:

Issue #5445 (2009-03-08):codecs.StreamWriter.writelines problem when passed generator
Issue #7262: (2009-11-04):codecs.open() + eol (windows)
Issue #8260 (2010-03-29):When I use codecs.open(…) and f.readline() follow up by f.read()return bad result
Issue #8630 (2010-05-05):Keepends param in codec readline(s)
Issue #10344 (2010-11-06):codecs.readline doesn’t care buffering
Issue #11461 (2011-03-10):Reading UTF-16 with codecs.readline() breaks on surrogate pairs
Issue #12446 (2011-06-30):StreamReader Readlines behavior odd
Issue #12508 (2011-07-06):Codecs Anomaly
Issue #12512 (2011-07-07):codecs: StreamWriter issues with stateful codecs after a seek orwith append mode
Issue #12513 (2011-07-07):codec.StreamReaderWriter: issues with interlaced read-write

TextIOWrapper features

TextIOWrapper supports any kind of newline, including translatingnewlines (to UNIX newlines), to read and write.
TextIOWrapper reuses codecs incremental encoders and decoders (noduplication of code).
The io module (TextIOWrapper) is faster than the codecs module(StreamReader). It is implemented in C, whereas codecs isimplemented in Python.
TextIOWrapper has a readahead algorithm which speeds up smallreads: read character by character or line by line (io is 10xthrough 25x faster than codecs on these operations).
TextIOWrapper has a write buffer.
TextIOWrapper.tell() is optimized.
TextIOWrapper supports random access (read+write) using a singleclass which permit to optimize interlaced read-write (but no suchoptimization is implemented).

TextIOWrapper issues

Issue #12215 (2011-05-30):TextIOWrapper: issues with interlaced read-write

Possible improvements of StreamReader and StreamWriter

By adding codec state read/write functions to the StreamReader andStreamWriter classes, it will become possible to fix issues withstateful codecs in a base class instead of in each statefulStreamReader and StreamWriter classes.

It would be possible to change StreamReader and StreamWriter to makethem use IncrementalDecoder and IncrementalEncoder.

A codec can implement variants which are optimized for the specificencoding or intercept certain stream methods to add functionality orimprove the encoding/decoding performance. TextIOWrapper cannotimplement such optimization, but TextIOWrapper uses incrementalencoders and decoders and uses read and write buffers, so the overheadof incomplete inputs is low or nul.

A lot more could be done for other variable length encoding codecs,e.g. UTF-8, since these often have problems near the end of a read dueto missing bytes. The UTF-32-BE/LE codecs could simply multiply thecharacter position by 4 to get the byte position.

Usage of StreamReader and StreamWriter

These classes are rarely used directly, but indirectly usingcodecs.open(). They are not used in Python 3 standard library (exceptin the codecs module).

Some projects implement their own codec with StreamReader andStreamWriter, but don’t use these classes.

Backwards Compatibility

Keep the public API, codecs.open

codecs.open() can be replaced by the builtin open() function. open()has a similar API but has also more options. Both functions returnfile-like objects (same API).

codecs.open() was the only way to open a text file in Unicode modeuntil Python 2.6. Many Python 2 programs uses this function. Removingcodecs.open() implies more work to port programs from Python 2 toPython 3, especially projects using the same code base for the twoPython versions (without using 2to3 program).

codecs.open() is kept for backward compatibility with Python 2.

Deprecate StreamReader and StreamWriter

Instantiating StreamReader or StreamWriter must emit a DeprecationWarning inPython 3.3. Defining a subclass doesn’t emit a DeprecationWarning.

codecs.open() will be changed to reuse the builtin open() function(TextIOWrapper) to read-write text files.

Alternative Approach

An alternative to the deprecation of the codecs.Stream* classes is to renamecodecs.open() to codecs.open_stream(), and to create a new codecs.open()function reusing open() and so io.TextIOWrapper.

Appendix A: Issues with stateful codecs

It is difficult to use correctly a stateful codec with a stream. Somecases are supported by the codecs module, while io has no more knownbug related to stateful codecs. The main difference between the codecsand the io module is that bugs have to be fixed in StreamReader and/orStreamWriter classes of each codec for the codecs module, whereas bugscan be fixed only once in io.TextIOWrapper. Here are some examples ofissues with stateful codecs.

Stateful codecs

Python supports the following stateful codecs:

cp932
cp949
cp950
euc_jis_2004
euc_jisx2003
euc_jp
euc_kr
gb18030
gbk
hz
iso2022_jp
iso2022_jp_1
iso2022_jp_2
iso2022_jp_2004
iso2022_jp_3
iso2022_jp_ext
iso2022_kr
shift_jis
shift_jis_2004
shift_jisx0213
utf_8_sig
utf_16
utf_32

Read and seek(0)

withopen(filename,'w',encoding='utf-16')asf:f.write('abc')f.write('def')f.seek(0)assertf.read()=='abcdef'f.seek(0)assertf.read()=='abcdef'

The io and codecs modules support this usecase correctly.

seek(n)

withopen(filename,'w',encoding='utf-16')asf:f.write('abc')pos=f.tell()withopen(filename,'w',encoding='utf-16')asf:f.seek(pos)f.write('def')f.seek(0)f.write('###')withopen(filename,'r',encoding='utf-16')asf:assertf.read()=='###def'

The io module supports this usecase, whereas codecs fails because itwrites a new BOM on the second write (issue #12512).

Append mode

withopen(filename,'w',encoding='utf-16')asf:f.write('abc')withopen(filename,'a',encoding='utf-16')asf:f.write('def')withopen(filename,'r',encoding='utf-16')asf:assertf.read()=='abcdef'