[Python-Dev] teaching the new urllib

Wed Feb 4 01:00:38 CET 2009

On Tue, Feb 3, 2009 at 15:50, Tres Seaver <tseaver at palladion.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Brett Cannon wrote:
>> On Tue, Feb 3, 2009 at 11:08, Brad Miller <millbr02 at luther.edu> wrote:
>>> I'm just getting ready to start the semester using my new book (Python
>>> Programming in Context) and noticed that I somehow missed all the changes to
>>> urllib in python 3.0.  ARGH to say the least.  I like using urllib in the
>>> intro class because we can get data from places that are more
>>> interesting/motivating/relevant to the students.
>>> Here are some of my observations on trying to do very basic stuff with
>>> urllib:
>>> 1.  urllib.urlopen  is now urllib.request.urlopen
>>
>> Technically urllib2.urlopen became urllib.request.urlopen. See PEP
>> 3108 for the details of the reorganization.
>>
>>> 2.  The object returned by urlopen is no longer iterable!  no more for line
>>> in url.
>>
>> That is probably a difference between urllib2 and urllib.
>>
>>> 3.  read, readline, readlines now return bytes objects or arrays of bytes
>>> instead of a str and array of str
>>
>> Correct.
>>
>>> 4.  Taking the naive approach to converting a bytes object to a str does not
>>> work as you would expect.
>>>
>>>>>> import urllib.request
>>>>>> page = urllib.request.urlopen('http://knuth.luther.edu/test.html')
>>>>>> page
>>> <addinfourl at 16419792 whose fp = <socket.SocketIO object at 0xfa8570>>
>>>>>> line = page.readline()
>>>>>> line
>>> b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n'
>>>>>> str(line)
>>> 'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\''
>>> As you can see from the example the 'b' becomes part of the string!  It
>>> seems like this should be a bug, is it?
>>>
>>
>> No because you are getting back the repr for the bytes object. Str
>> does not know what the encoding is for the bytes so it has no way of
>> performing the decoding.
>
> The encoding information *is* available in the response headers, e.g.:
>
> - ---------------------- %< ---------------------------------
> $ wget -S --spider http://knuth.luther.edu/test.html
> - --18:46:24--  http://knuth.luther.edu/test.html
>           => `test.html'
> Resolving knuth.luther.edu... 192.203.196.71
> Connecting to knuth.luther.edu|192.203.196.71|:80... connected.
> HTTP request sent, awaiting response...
>  HTTP/1.1 200 OK
>  Date: Tue, 03 Feb 2009 23:46:28 GMT
>  Server: Apache/2.0.50 (Linux/SUSE)
>  Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT
>  ETag: "2fcd8-1d8-43b2bf40"
>  Accept-Ranges: bytes
>  Content-Length: 472
>  Keep-Alive: timeout=15, max=100
>  Connection: Keep-Alive
>  Content-Type: text/html; charset=ISO-8859-1
> Length: 472 [text/html]
> 200 OK
> - ---------------------- %< ---------------------------------
>

Right, but he was asking about why passing bytes to str() led to it
returning the repr.

> So, the OP's use case *could* be satisfied, assuming that the Py3K
> version of urllib sprouted a means of leveraging that header.  In this
> sense, fetching the resource over HTTP is *better* than loading it from
> a file:  information about the character set is explicit, and highly
> likely to be correct, at least for any resource people expect to render
> cleanly in a browser.

Right. And even if the header lacks the info as Content-Type is not
guaranteed to contain the charset there is also the chance for the
HTML or DOCTYPE declaration to say.

But as Bill pointed out, urllib just fetches data via HTTP, so a
character encoding will not always be valuable. Best solution would be
to provide something in html that can take what urllib.request.urlopen
returns and handle the decoding.

-Brett