Unescaping URLs in Python

John Nagle nagle at animats.com
Sun Dec 24 22:52:45 EST 2006


Here's a URL from a link on the home page of a major company.

	<a href="/adsk/servlet/index?siteID=123112&id=1860142">About Us</a>

Yes, that "&" is in the source text of the page.

This is, in fact, correct HTML. See

	http://www.htmlhelp.com/tools/validator/problems.html#amp

     What's the appropriate Python function to call to unescape a URL which might
contain things like that?  Will this interfere with the usual "%" type escapes
in URLs?

     What's actually needed to get this right is something that goes from
HTML escaped form to URL escaped form, because, in general, there is no
unescaped form that will work for all URLs.

There's "htmldecode" at "http://zesty.ca/python/scrape.py", which works,
but this should be a standard library function.
				
				John Nagle



More information about the Python-list mailing list