urllib.quote fails on Unicode URL

Peter Otten __peter__ at web.de
Fri May 4 03:14:31 EDT 2007


John Nagle wrote:

>     The code in urllib.quote fails on Unicode input, when
> called by robotparser.
> 
>     That bit of code needs some attention.
>     - It still assumes ASCII goes up to 255, which hasn't been true in
>     Python
>       for a while now.
>     - The initialization may not be thread-safe; a table is being
>     initialized
>       on first use.  The code is too clever and uncommented.
> 
> "robotparser" was trying to check if a URL,
>
"http://www.highbeam.com/DynamicContent/%E2%80%9D/mysaved/privacyPref.asp%22"
> could be accessed, and there are some wierd characters in there.  Unicode
> URLs are legal, so this is a real bug.
> 
> Logged in as Bug #1712522.

There has been a related discussion:

http://groups.google.com/group/comp.lang.python/browse_thread/thread/b331dc3625dbfc41/ce6e6a3c0635e340

IIRC the outcome was that while UTF-8 is recommended
urllib.quote()/unquote() should not guess the encoding.

What changes that would imply for robotparser I don't know...

Peter



More information about the Python-list mailing list