[Python-Dev] teaching the new urllib

Tue Feb 3 20:10:12 CET 2009

I'm just getting ready to start the semester using my new book (Python
Programming in Context) and noticed that I somehow missed all the changes to
urllib in python 3.0.  ARGH to say the least.  I like using urllib in the
intro class because we can get data from places that are more
interesting/motivating/relevant to the students.
Here are some of my observations on trying to do very basic stuff with
urllib:

1.  urllib.urlopen  is now urllib.request.urlopen
2.  The object returned by urlopen is no longer iterable!  no more for line
in url.
3.  read, readline, readlines now return bytes objects or arrays of bytes
instead of a str and array of str
4.  Taking the naive approach to converting a bytes object to a str does not
work as you would expect.

>>> import urllib.request
>>> page = urllib.request.urlopen('http://knuth.luther.edu/test.html')
>>> page
<addinfourl at 16419792 whose fp = <socket.SocketIO object at 0xfa8570>>
>>> line = page.readline()
>>> line
b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n'
>>> str(line)
'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\''
>>>

As you can see from the example the 'b' becomes part of the string!  It
seems like this should be a bug, is it?

Here's the iteration problem:
'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\''
>>> for line in page:
print(line)

Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    for line in page:
TypeError: 'addinfourl' object is not iterable

Why is this not iterable anymore?  Is this too a bug?  What the heck is an
addinfourl object?

5.  Finally, I see that a bytes object has some of the same methods as
strings.  But the error messages are confusing.

>>> line
b'   "http://www.w3.org/TR/html4/loose.dtd">\n'
>>> line.find('www')
Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    line.find('www')
TypeError: expected an object with the buffer interface
>>> line.find(b'www')
11

Why couldn't find take string as a parameter?

If folks have advice on which, if any, of these are bugs please let me know
and I'll file them, and if possible work on fixes for them too.

If you have advice on how I should better be teaching this new urllib that
would be great to hear as well.

Thanks,

Brad

-- 
Brad Miller
Assistant Professor, Computer Science
Luther College

-- 
Brad Miller
Assistant Professor, Computer Science
Luther College
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20090203/4391a692/attachment-0001.htm>