urllib.quote fails on Unicode URL
Peter Otten
__peter__ at web.de
Fri May 4 03:14:31 EDT 2007
John Nagle wrote:
> The code in urllib.quote fails on Unicode input, when
> called by robotparser.
>
> That bit of code needs some attention.
> - It still assumes ASCII goes up to 255, which hasn't been true in
> Python
> for a while now.
> - The initialization may not be thread-safe; a table is being
> initialized
> on first use. The code is too clever and uncommented.
>
> "robotparser" was trying to check if a URL,
>
"http://www.highbeam.com/DynamicContent/%E2%80%9D/mysaved/privacyPref.asp%22"
> could be accessed, and there are some wierd characters in there. Unicode
> URLs are legal, so this is a real bug.
>
> Logged in as Bug #1712522.
There has been a related discussion:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/b331dc3625dbfc41/ce6e6a3c0635e340
IIRC the outcome was that while UTF-8 is recommended
urllib.quote()/unquote() should not guess the encoding.
What changes that would imply for robotparser I don't know...
Peter
More information about the Python-list
mailing list