[Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

Fri May 27 10:17:29 CEST 2011

Victor Stinner wrote:
> Le mercredi 25 mai 2011 à 15:43 +0200, M.-A. Lemburg a écrit :
>> For UTF-16 it would e.g. make sense to always read data in blocks
>> with even sizes, removing the trial-and-error decoding and extra
>> buffering currently done by the base classes. For UTF-32, the
>> blocks should have size % 4 == 0.
>>
>> For UTF-8 (and other variable length encodings) it would make
>> sense looking at the end of the (bytes) data read from the
>> stream to see whether a complete code point was read or not,
>> rather than simply running the decoder on the complete data
>> set, only to find that a few bytes at the end are missing.
> 
> I think that the readahead algorithm is much more faster than trying to
> avoid partial input, and it's not a problem to have partial input if you
> use an incremental decoder.

Depends on where you're coming from. For non-seekable streams
such as sockets or pipes, readahead is not going to work.

For seekable streams, I agree that readahead is better strategy.

And of course, it also makes sense to use incremental decoders
for these encodings.

>> For single character encodings, it would make sense to prefetch
>> data in big chunks and skip all the trial and error decoding
>> implemented by the base classes to address the above problem
>> with variable length encodings.
> 
> TextIOWrapper implements this optimization using its readahead
> algorithm.

It does yes, but the above was an optimization specific
to single character encodings, not all encodings and
TextIOWrapper doesn't know anything about specific characteristics
of the underlying encodings (except perhaps a few special
cases).

>> That's somewhat unfair: TextIOWrapper is implemented in C,
>> whereas the StreamReader/Writer subclasses used by the
>> codecs are written in Python.
>>
>> A fair comparison would use the Python implementation of
>> TextIOWrapper.
> 
> Do you mean that you would like to reimplement codecs in C? 

As use of Unicode codecs increases in Python applications,
this would certainly be an approach to consider, yes.

Looking at the current situation, it is better to use
TextIOWrapper as it provides better performance, but since
TextIOWrapper cannot (per desing) provide per-codec optimizations,
this is likely to change with a codec rewrite in C of codecs
that benefit a lot from such specific optimizations.

> It is not
> revelant to compare codecs and _pyio, because codecs reuses
> BufferedReader (of the io module, not of the _pyio module), and io is
> the main I/O module of Python 3.

They both use whatever stream you pass in as parameter,
so your TextIOWrapper benchmark will also use the BufferedReader
of the io module.

The point here is to compare Python to Python, not Python
to C.

> But well, as you want, here is a benchmark comparing:
>    _pyio.TextIOWrapper(io.open(filename, 'rb'), encoding)
> and 
>     codecs.open(filename, encoding)
> 
> The only change with my previous bench.py script is the test_io()
> function :
> 
> def test_io(test_func, chunk_size):
>     with open(FILENAME, 'rb') as buffered:
>         f = _pyio.TextIOWrapper(buffered, ENCODING)
>         test_file(f, test_func, chunk_size)
>         f.close()

Thanks for running those tests.

> (1) Decode Objects/unicodeobject.c (317336 characters) from utf-8
> 
> test_io.readline(): 1193.4 ms
> test_codecs.readline(): 1267.9 ms
> -> codecs 6% slower than io
> 
> test_io.read(1): 21696.4 ms
> test_codecs.read(1): 36027.2 ms
> -> codecs 66% slower than io
> 
> test_io.read(100): 3080.7 ms
> test_codecs.read(100): 3901.7 ms
> -> codecs 27% slower than io

This shows that StreamReader/Writer could benefit quite
a bit from using incremental encoders/decoders.

> test_io.read(): 3991.0 ms
> test_codecs.read(): 1736.9 ms
> -> codecs 130% FASTER than io

No surprise here. It's also a very common use case
to read the whole file in one go and the bigger
the file, the more impact this has.

> (2) Decode README (6613 characters) from ascii
> 
> test_io.readline(): 678.1 ms
> test_codecs.readline(): 760.5 ms
> -> codecs 12% slower than io
> 
> test_io.read(1): 13533.2 ms
> test_codecs.read(1): 21900.0 ms
> -> codecs 62% slower than io
> 
> test_io.read(100): 2663.1 ms
> test_codecs.read(100): 3270.1 ms
> -> codecs 23% slower than io
> 
> test_io.read(): 6769.1 ms
> test_codecs.read(): 3919.6 ms
> -> codecs 73% FASTER than io

See above.

> (3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from
> gb18030
> 
> test_io.readline(): 38.9 ms
> test_codecs.readline(): 15.1 ms
> -> codecs 157% FASTER than io
> 
> test_io.read(1): 369.8 ms
> test_codecs.read(1): 302.2 ms
> -> codecs 22% FASTER than io
> 
> test_io.read(100): 258.2 ms
> test_codecs.read(100): 155.1 ms
> -> codecs 67% FASTER than io
> 
> test_io.read(): 1803.2 ms
> test_codecs.read(): 1002.9 ms
> -> codecs 80% FASTER than io

These results are interesting since gb18030 is a shift
encoding which keeps state in the encoded data stream, so
the strategy chosen by TextIOWrapper doesn't work out that
well.

It hints to what I mentioned above: per codec optimizations
are going to be relevant once these codecs get a lot of use.

> _pyio.TextIOWrapper is faster than codecs.StreamReader for readline(),
> read(1) and read(100), with ASCII and UTF-8. It is slower for gb18030.
> 
> As in the io vs codecs benchmark, codecs.StreamReader is always faster
> than _pyio.TextIOWrapper for read().

Just to repeat it here what I already mentioned on the ticket:

I am still -1 on deprecating the StreamReader/Writer parts of
the codec APIs. I've given numerous reasons on why these are
useful, what their intention is, why they were added to Python 1.6.

Since such a deprecation would change an important documented API,
please write a PEP outlining your reasoning, including my comments,
use cases and possibilities for optimizations.

Please back out your checkin:

"""
http://hg.python.org/cpython/rev/3555cf6f9c98
changeset:   70430:3555cf6f9c98
user:        Victor Stinner <victor.stinner at haypocalc.com>
date:        Fri May 27 01:51:18 2011 +0200
summary:
  Issue #8796: codecs.open() calls the builtin open() function instead of using
StreamReaderWriter. Deprecate StreamReader, StreamWriter, StreamReaderWriter,
StreamRecoder and EncodedFile() of the codec module. Use the builtin open()
function or io.TextIOWrapper instead.

files:
  Doc/library/codecs.rst  |   25 ++++
  Lib/codecs.py           |   25 ++--
  Lib/test/test_codecs.py |  152 +++++++++++++++++++--------
  Misc/NEWS               |    5 +
  4 files changed, 148 insertions(+), 59 deletions(-)
"""

I wasn't very happy to see that checkin on the checkins list...

We can discuss changing codec.open() to use TextIOWrapper, but
your quest for deprecating APIs in Python has gone too far on
this one.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 27 2011)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2011-05-23: Released eGenix mx Base 3.2.0      http://python.egenix.com/
2011-05-25: Released mxODBC 3.1.1              http://python.egenix.com/
2011-06-20: EuroPython 2011, Florence, Italy               24 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/