URLs and ampersands

Tue Aug 5 08:07:39 EDT 2008

Steven D'Aprano <steven at REMOVE.THIS.cybersource.com.au> wrote:

> I didn't say it urlretrieve was escaping the URL. I actually think the
> URLs are pre-escaped when I scrape them from a HTML file. I have
> searched for, but been unable to find, standard library functions that
> escapes or unescapes URLs. Are there any such functions?
> 
Whenever you put a URL into an HTML file you need to escape it, so 
naturally you will also need to unescape it when it is retrieved from the 
file. However, whatever you use to parse the HMTL ought to be unescaping 
text and attributes as part of the parsing process, so you shouldn't need a 
separate function for this.

e.g.
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<a href="http://www.example.com/parrot.php?x=1
&y=2">link</a>''')
>>> soup.contents[0]['href']
u'http://www.example.com/parrot.php?x=1&y=2'
>>> 

Even Python's builtin HTMLParser class will do this for you. What parser 
are you using?

-- 
Duncan Booth http://kupuguy.blogspot.com