Dr. Dobb's Python-URL! - weekly Python news and links (Dec 30)

Tue Jan 4 11:01:43 EST 2005

On Tue, 04 Jan 2005 16:41:05 +0100, Thomas Heller <theller at python.net> wrote:
>Skip Montanaro <skip at pobox.com> writes:
> 
> >     michele> BTW what's the difference between .encode and .decode ?
> >
> > I started to answer, then got confused when I read the docstrings for
> > unicode.encode and unicode.decode:
> >
> > [snip - docstrings]
> >
> > It probably makes sense to one who knows, but for the feeble-minded like
> > myself, they seem about the same.
> 
> It seems also the error messages aren't too helpful:
> 
> >>> "Ã¤".encode("latin-1")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position 0: ordinal not in range(128)
> >>>
> 
> Hm, why does the 'encode' call complain about decoding?
> 
> Why do string objects have an encode method, and why do unicode objects
> have a decode method, and what does this error message want to tell me:
> 
> >>> u"Ã¤".decode("latin-1")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
> >>>

   The call

       unicode.decode(codec)

   is actually doing this

       unicode.encode(sys.getdefaultencoding()).decode(codec)

   This is not a particularly nice thing.  I'm not sure who thought
it was a good idea.  One possibility is that .encode() and .decode()
are not _only_ for converting between unicode and encoded bytestrings.
For example, there is the zlib codec, the rot13 codec, and applications
can define their own codecs with arbitrary behavior.  It's entirely 
possible to write a codec that decodes _from_ unicode objects _to_ 
unicode objects and encodes the same way.  So unicode objects need both
methods to support this use case.

  Jp