urllib2: non-ascii characters in url

Achim Domma domma at procoders.net
Wed Jun 18 07:46:29 EDT 2003


Hi,

I have script which crawls webpages. The pages are downloaded using
urllib2.urlopen and parsed with htmllib.HTMLParser. If I find a link, I make
it absolut using urljoin the the original url of the current document. This
works fine, as long as all urls contain only ascii charakters. But now I
have urls with french characters and get the following traceback:

[...]
    return urljoin(self.url,url)
  File "D:\Python23\lib\urlparse.py", line 188, in urljoin
    return urlunparse((scheme, netloc, '/'.join(segments),
  File "D:\Python23\lib\urlparse.py", line 125, in urlunparse
    return urlunsplit((scheme, netloc, url, query, fragment))
  File "D:\Python23\lib\urlparse.py", line 134, in urlunsplit
    url = url + '?' + query
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 13:
ordinal not in range(128)

As far as I understand these characters are not allowed in urls, so I
wouln't consider this a bug in urllib2, but in reality there are these
characters and I have to handle them.

Any idea how to solve this problem?

regards,
Achim






More information about the Python-list mailing list