[Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

Sat Feb 18 23:33:15 CET 2006

On Sat, Feb 18, 2006 at 01:21:18PM +0100, M.-A. Lemburg wrote:

> It's by no means a Perl attitude.

In your eyes, perhaps. It certainly feels that way to me (or I wouldn't have
said it :). Perl happens to be full of general constructs that were added
because they were easy to add, or they were useful in edgecases. The
encode/decode methods remind me of that, even though I fully understand the
reasoning behind it, and the elegance of the implementation.

> The main reason is symmetry and the fact that strings and Unicode
> should be as similar as possible in order to simplify the task of
> moving from one to the other.

Yes, and this is a design choice I don't agree with. They're different
types. They do everything similarly, except when they are mixed together
(unicode takes precedence, in general, encoding the bytestring from the
default encoding.) Going from one to the other isn't symmetric, though. I
understand that you disagree; the disagreement is on the fundamental choice
of allowing 'encode' and 'decode' to do *more* than going from and to
unicode. I regret that decision, not the decision to make encode and decode
symmetric (which makes sense, after the decision to overgeneralize
encode/decode is made.)

> >  - The return value for the non-unicode encodings depends on the value of
> >    the encoding argument.

> Not really: you'll always get a basestring instance.

Which is not a particularly useful distinction, since in any real world
application, you have to be careful not to mix unicode with (non-ascii)
bytestrings. The only way to reliably deal with unicode is to have it
well-contained (when migrating an application from using bytestrings to
using unicode) or to use unicode everywhere, decoding/encoding at
entrypoints. Containment is hard to achieve.

> Still, I believe that this is an educational problem. There are
> a couple of gotchas users will have to be aware of (and this is
> unrelated to the methods in question):
> 
> * "encoding" always refers to transforming original data into
>   a derived form
> 
> * "decoding" always refers to transforming a derived form of
>   data back into its original form
> 
> * for Unicode codecs the original form is Unicode, the derived
>   form is, in most cases, a string
> 
> As a result, if you want to use a Unicode codec such as utf-8,
> you encode Unicode into a utf-8 string and decode a utf-8 string
> into Unicode.
> 
> Encoding a string is only possible if the string itself is
> original data, e.g. some data that is supposed to be transformed
> into a base64 encoded form.
> 
> Decoding Unicode is only possible if the Unicode string itself
> represents a derived form, e.g. a sequence of hex literals.

Most of these gotchas would not have been gotchas had encode/decode only
been usable for unicode encodings.

> > That is why I disagree with the hypergeneralization of the encode/decode
> > methods
[..]
> That's because you only look at one specific task.

> Codecs also unify the various interfaces to common encodings
> such as base64, uu or zip which are not Unicode related.

No, I think you misunderstand. I object to the hypergeneralization of the
*encode/decode methods*, not the codec system. I would have been fine with
another set of methods for non-unicode transformations. Although I would
have been even more fine if they got their encoding not as a string, but as,
say, a module object, or something imported from a module.

Not that I think any of this matters; we have what we have and I'll have to
live with it ;)

-- 
Thomas Wouters <thomas at xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!