URLs and ampersands

Tue Aug 5 13:04:01 EDT 2008

En Tue, 05 Aug 2008 06:59:20 -0300, Steven D'Aprano <steven at remove.this.cybersource.com.au> escribió:

> On Mon, 04 Aug 2008 23:16:46 -0300, Gabriel Genellina wrote:
>
>> En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
>> <steve at REMOVE-THIS-cybersource.com.au> escribi�:
>>
>>> I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
>>> snag with URLs containing ampersands:
>>>
>>> http://www.example.com/parrot.php?x=1&y=2
>>>
>>> Somewhere in the process, urls like the above are escaped to:
>>>
>>> http://www.example.com/parrot.php?x=1&y=2
>>>
>>> which naturally fails to exist.
>>>
>>> I could just do a string replace, but is there a "right" way to escape
>>> and unescape URLs? I've looked through the standard lib, but I can't
>>> find anything helpful.
>>
>> This works fine for me:
>>
>> py> import urllib
>> py> fn =
>> urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
>> &c=4551022")[0]
>> py> open(fn,"rb").read()
>> '\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...
>>
>> So it's not urlretrieve escaping the url, but something else in your
>> code...
>
> I didn't say it urlretrieve was escaping the URL. I actually think the
> URLs are pre-escaped when I scrape them from a HTML file. 

(Ok, you didn't even menction you were scraping HTML pages...)

> I have searched
> for, but been unable to find, standard library functions that escapes or
> unescapes URLs. Are there any such functions?

Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.
How are you scraping the HTML source? Both BeautifulSoup and ElementTree.HTMLTreeBuilder already do that work for you.

-- 
Gabriel Genellina