[Tutor] urllib confusion

Steven D'Aprano steve at pearwood.info
Sat Nov 22 13:15:35 CET 2014


On Fri, Nov 21, 2014 at 01:37:45PM -0800, Clayton Kirkwood wrote:

> Got a general problem with url work. I've struggled through a lot of code
> which uses urllib.[parse,request]* and urllib2. First q: I read someplace in
> urllib documentation which makes it sound like either urllib or urllib2
> modules are being deprecated in 3.5. Don't know if it's only part or whole.

Can you point us to this place? I would be shocked and rather dismayed 
to hear that urllib(2) was being deprecated, but it is possible that one 
small component is being renamed/moved/deprecated.

> I've read through a lot that says that urllib..urlopen needs urlencode,
> and/or encode('utf-8') for byte conversion, but I've seen plenty of examples
> where nothing is being encoded either way. I also have a sneeking suspicious
> that urllib2 code does all of the encoding. I've read that if things aren't
> encoded that I will get TypeError, yet I've seen plenty of examples where
> there is no error and no encoding.

It's hard to comment and things you've read when we don't know what they 
are or precisely what they say. "I read that..." is the equivalent of "a 
man down the pub told me...".

If the examples are all ASCII, then no charset encoding is 
needed, although urlencode will still perform percent-encoding:

py> from urllib.parse import urlencode
py> urlencode({"key": "<value>"})
'key=%3Cvalue%3E'

The characters '<' and '>' are not legal inside URLs, so they have to be 
encoded as '%3C' and '%3E'. Because all the characters are ASCII, the 
result remains untouched.

Non-ASCII characters, on the other hand, are encoded into UTF-8 by 
default, although you can pick another encoding and/or error handler:

py> urlencode({"key": "© 2014"})
'key=%C2%A9+2014'

The copyright symbol © encoded into UTF-8 is the two bytes 
\xC2\xA9 which are then percent encoded into %C2%A9.


> Why do so many examples seem to not encode? And not get TypeError? And yes,
> for those of you who are about to suggest it, I have tried a lot of things
> and read for many hours.

One actual example is worth about a thousand vague descriptions.

But in general, I would expect that the urllib functions default to 
using UTF-8 as the encoding, so you don't have to manually specify an 
encoding, it just works.


-- 
Steven


More information about the Tutor mailing list