python3 urlopen(...).read() returns bytes

Mon Dec 22 17:25:10 EST 2008

Glenn G. Chappell schrieb:
> I just ran 2to3 on a py2.5 script that does pattern matching on the
> text of a web page. The resulting script crashed, because when I did
> 
>     f = urllib.request.urlopen(url)
>     text = f.read()
> 
> then "text" is a bytes object, not a string, and so I can't do a
> regexp on it.
> 
> Of course, this is easy to patch: just do "f.read().decode()".
> However, it strikes me as an obvious bug, which ought to be fixed.
> That is, read() should return a string, as it did in py2.5.

It's not possible unless you know the encoding of the bytes. Network io
only returns byte and you must encode it explicitly. You "patch" breaks
as soon as a remote sites returns the data in a different encoding. It
also breaks if the site returns an image/*, appliation/*, audio/* or any
other mimetype than text.
There is no generic and simple way to detect the encoding of a remote
site. Sometimes the encoding is mentioned in the HTTP header, sometimes
it's embedded in the <head> section of the HTML document.

> This change breaks pretty much every Python program that opens a
> webpage, doesn't it? 2to3 doesn't catch it, and, in any case, why
> should read() return bytes, not string? Am I missing something?

I hope I was able to explain the issue. By the way Python 2.x and 3.0
are both returning bytes (str in 2.x, bytes in 3.0).

Christian