[Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

Wed May 25 17:48:11 CEST 2011

Le mercredi 25 mai 2011 à 15:43 +0200, M.-A. Lemburg a écrit :
> For UTF-16 it would e.g. make sense to always read data in blocks
> with even sizes, removing the trial-and-error decoding and extra
> buffering currently done by the base classes. For UTF-32, the
> blocks should have size % 4 == 0.
>
> For UTF-8 (and other variable length encodings) it would make
> sense looking at the end of the (bytes) data read from the
> stream to see whether a complete code point was read or not,
> rather than simply running the decoder on the complete data
> set, only to find that a few bytes at the end are missing.

I think that the readahead algorithm is much more faster than trying to
avoid partial input, and it's not a problem to have partial input if you
use an incremental decoder.

> For single character encodings, it would make sense to prefetch
> data in big chunks and skip all the trial and error decoding
> implemented by the base classes to address the above problem
> with variable length encodings.

TextIOWrapper implements this optimization using its readahead
algorithm.

> That's somewhat unfair: TextIOWrapper is implemented in C,
> whereas the StreamReader/Writer subclasses used by the
> codecs are written in Python.
> 
> A fair comparison would use the Python implementation of
> TextIOWrapper.

Do you mean that you would like to reimplement codecs in C? It is not
revelant to compare codecs and _pyio, because codecs reuses
BufferedReader (of the io module, not of the _pyio module), and io is
the main I/O module of Python 3.

But well, as you want, here is a benchmark comparing:
   _pyio.TextIOWrapper(io.open(filename, 'rb'), encoding)
and 
    codecs.open(filename, encoding)

The only change with my previous bench.py script is the test_io()
function :

def test_io(test_func, chunk_size):
    with open(FILENAME, 'rb') as buffered:
        f = _pyio.TextIOWrapper(buffered, ENCODING)
        test_file(f, test_func, chunk_size)
        f.close()

(1) Decode Objects/unicodeobject.c (317336 characters) from utf-8

test_io.readline(): 1193.4 ms
test_codecs.readline(): 1267.9 ms
-> codecs 6% slower than io

test_io.read(1): 21696.4 ms
test_codecs.read(1): 36027.2 ms
-> codecs 66% slower than io

test_io.read(100): 3080.7 ms
test_codecs.read(100): 3901.7 ms
-> codecs 27% slower than io

test_io.read(): 3991.0 ms
test_codecs.read(): 1736.9 ms
-> codecs 130% FASTER than io

(2) Decode README (6613 characters) from ascii

test_io.readline(): 678.1 ms
test_codecs.readline(): 760.5 ms
-> codecs 12% slower than io

test_io.read(1): 13533.2 ms
test_codecs.read(1): 21900.0 ms
-> codecs 62% slower than io

test_io.read(100): 2663.1 ms
test_codecs.read(100): 3270.1 ms
-> codecs 23% slower than io

test_io.read(): 6769.1 ms
test_codecs.read(): 3919.6 ms
-> codecs 73% FASTER than io

(3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from
gb18030

test_io.readline(): 38.9 ms
test_codecs.readline(): 15.1 ms
-> codecs 157% FASTER than io

test_io.read(1): 369.8 ms
test_codecs.read(1): 302.2 ms
-> codecs 22% FASTER than io

test_io.read(100): 258.2 ms
test_codecs.read(100): 155.1 ms
-> codecs 67% FASTER than io

test_io.read(): 1803.2 ms
test_codecs.read(): 1002.9 ms
-> codecs 80% FASTER than io

_pyio.TextIOWrapper is faster than codecs.StreamReader for readline(),
read(1) and read(100), with ASCII and UTF-8. It is slower for gb18030.

As in the io vs codecs benchmark, codecs.StreamReader is always faster
than _pyio.TextIOWrapper for read().

Victor