urllib.unquote and unicode

Tue Dec 19 08:24:25 EST 2006

Fredrik Lundh wrote:
> George Sakkis wrote:
>
> > The following snippet results in different outcome for (at least) the
> > last three major releases:
> >
> >>>> import urllib
> >>>> urllib.unquote(u'%94')
> >
> > # Python 2.3.4
> > u'%94'
> >
> > # Python 2.4.2
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
> > ordinal not in range(128)
> >
> > # Python 2.5
> > u'\x94'
> >
> > Is the current version the "right" one or is this function supposed to
> > change every other week ?
>
> why are you passing non-ASCII Unicode strings to a function designed for
> fixing up 8-bit strings in the first place?  if you do proper encoding
> before you quote things, it'll work the same way in all Python releases.

I'm using BeautifulSoup, which from version 3 returns Unicode only, and
I stumbled on a page with such bogus char encodings; I have the
impression that whatever generated it used ord() to encode reserved
characters instead of the proper hex representation in latin-1. If
that's the case, unquote() won't do anyway and I'd have to go with
chr() on the number part.

George