urllib2: non-ascii characters in url

Wed Jun 18 08:23:47 EDT 2003

Achim Domma wrote:

> I have script which crawls webpages..... But now I
> have urls with french characters and get the following traceback:

>   File "D:\Python23\lib\urlparse.py", line 134, in urlunsplit
>     url = url + '?' + query
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 13:
> ordinal not in range(128)
> 
> As far as I understand these characters are not allowed in urls, so I
> wouln't consider this a bug in urllib2, but in reality there are these
> characters and I have to handle them.

You're right, characters above 128 are not allowed in URLs. They should be
escaped as %xx characters, where xx is the hex character code in some character
set

http://rfc.net/rfc2396.html

But the RFC doesn't specify which character set the character codes should be
in. Instead, it says that URI schemes, e.g. HTTP, should devise some method of
specifying which character encoding is in effect, or specify a default.

"  For original character sequences that contain non-ASCII characters,
   however, the situation is more difficult. Internet protocols that
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset used, if
   there might be more than one [RFC2277].  However, there is currently
   no provision within the generic URI syntax to accomplish this
   identification. An individual URI scheme may require a single
   charset, define a default charset, or provide a way to indicate the
   charset used."

As far as I'm aware, there is no default charset for the HTTP scheme, although
iso-8859-1 seems to be a good choice.

So, if you're receiving URIs with French characters in them, you should escape
them to %xx character references, and then pass those to URI retrieval function. 

The functions urllib.quote and urllib.unquote serve exactly this purpose
(assuming that they correctly decide on the encoding of the input):-

Python 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urlparse
>>> import urllib
>>> def correcturl(u):
...     s, a, p, q, f = urlparse.urlsplit(u)
...     p = urllib.quote(p)
...     return urlparse.urlunsplit((s, a, p, q, f))
...
>>> correcturl("http://www.co.fr/vous/êtes/ici")
'http://www.co.fr/vous/%88tes/ici'
>>>

(There is an e-circumflex in the above URL, which may not transport correctly to
everyone's email/newsreader).

Does that solve the problem?

P.S. RFC 2396 is worth a read sometime. It's fairly easy reading, and will help
you understand some of the complexities behind URIs, potentially the most
commonly used, and thus successful, identifier schemas ever.

HTH,

-- 
alan kennedy
-----------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan:              http://xhaus.com/mailto/alan