URLs and ampersands

Tue Aug 5 13:06:58 EDT 2008

Steven D'Aprano wrote:
> I'm using urllib.urlretrieve() to download HTML pages, and I've hit a 
> snag with URLs containing ampersands:
>
> http://www.example.com/parrot.php?x=1&y=2
>
> Somewhere in the process, urls like the above are escaped to:
>
> http://www.example.com/parrot.php?x=1&y=2
>
> which naturally fails to exist.
>
> I could just do a string replace, but is there a "right" way to escape 
> and unescape URLs? I've looked through the standard lib, but I can't find 
> anything helpful.

I don't believe there is a concept of 'escaping a URL' as such. How you
escape or unescape a URL depends on what context you're embedding it in
or extracting it from.

In this case, it looks like you have URLs which have been escaped to go
into an html CDATA attribute value (such as <a href="...">).

I believe there is no documented function in the Python standard library
which reverses this escaping (short of putting your string into a
larger document and parsing that with a full html or xml parser).

-M-