[Python-Dev] Python 3.0.1 (io-in-c)

Wed Jan 28 17:54:49 CET 2009

2009/1/28 Antoine Pitrou <solipsis at pitrou.net>:
> If you look at how utf-8 decoding is implemented (in unicodeobject.c), it's
> quite obvious why it is so :-) There is a (very) fast path for chunks of pure
> ASCII data, and (fast but not blazingly fast) fallback for non ASCII data.

Thanks for the explanation.

> Please don't think of it as a slowdown... It's still much faster than 2.x, which
> manages 130MB/s on the same data.

Don't get me wrong - I'm hugely grateful for this work. And
personally, I don't expect that I/O speed is ever likely to be a real
bottleneck in the type of program I write. But I'm concerned that
(much as with the whole "Python 3.0 is incompatible, and it will be
hard to port to" meme) people will pick up on raw benchmark figures -
no matter how much they aren't comparing like with like - and start
making it sound like "Python 3.0 I/O is slower than 2.x" - which is a
great disservice to the good work that's been done.

I do think it's worth taking care over the default encoding, though.
Quite apart from performance, getting "correct" behaviour is
important. I can't speak for Unix, but on Windows, the following
behaviour feels like a bug to me:

>echo a£b >a1

>python
Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print open("a1").read()
a£b

>>> ^Z

>\Apps\Python30\python.exe
Python 3.0 (r30:67507, Dec  3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print(open("a1").read())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Apps\Python30\lib\io.py", line 1491, in write
    b = encoder.encode(s)
  File "D:\Apps\Python30\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0153' in
position 1: character maps to <undefined>
>>> ^Z

>chcp
Active code page: 850

Paul.