python3 urlopen(...).read() returns bytes

Mon Dec 22 22:19:32 EST 2008

Glenn G. Chappell schrieb:
> Okay, so I guess I didn't really *get* the whole unicode/text/binary
> thing. Maybe I still don't, but I think I'm getting closer. Thanks to
> everyone who replied.

The basic principal is easy. On the one hand you have some text as
unicode data, on the other hand you have some binary data that may
contain text in an arbitrary encoding. In order to get the text you have
to decode the byte data into unicode. The other way around is called
encoding.

Everybody in the whole world has to deal with unicode *unless* you are
living in USA and all you have is plain and simple ASCII text. Python
2.x makes no difference between text in ASCII and arbitrary bytes. Both
are stored in the str type. This makes it easy for ASCII country but the
rest of the world suffers the consequences.

Python 3.0 makes a hard break for ASCII people because with 3.0 really
everybody has to deal with encodings. There is no more implicit
conversion between ASCII text and unicode.
http://www.joelonsoftware.com/articles/Unicode.html explains it in great
detail.

> 
> On Dec 22, 1:41 pm, ajaksu <aja... at gmail.com> wrote:
>> On Dec 22, 8:25 pm, Christian Heimes <li... at cheimes.de> wrote:
>> That said, a "decode to declared HTTP header encoding" version of
>> urlopen could be useful to give some users the output they want (text
>> from network io) or to make it clear why bytes is the safe way.
> 
> Sounds like a great idea. More to the point, it sounds like it's
> pretty much a necessary idea.
> 
> Consider: reading a web page is an easy one-liner. Now, no one is
> going to write that one-liner, and then spend 20 lines trying to get
> the Content-Type and encoding figured out. Instead we're all going to
> do it the short, easy, *wrong* way. So every program in the world that
> uses urlopen gets to have the same bug. Not good. The *right* way
> needs to be the *easy* way.

Python 2.x suffers from the same problem. It just doesn't tell you from
the beginning that you need to deal with the problem. With 2.x you can
read websites fine - until you have to deal with a non English, non
ASCII text. 3.0 forces the developer to think about the issue right from
the beginning. No more excuses :)

I suggest somebody makes a feature request for 3.1. A patch with unit
test increases the changes for the patch by at least one magnitude.

Christian