[Python-Dev] bytes.from_hex()

Thu Mar 2 06:16:52 CET 2006

Ron Adam wrote:

> 1. We can specify the operation and not be sure of the resulting type.
> 
>    *or*
> 
> 2. We can specify the type and not always be sure of the operation.
> 
> maybe there's a way to specify both so it's unambiguous?

Here's another take on the matter. When we're doing
Unicode encoding or decoding, we're performing a
type conversion. The natural way to write a type
conversion in Python is with a constructor. But we
can't just say

   u = unicode(b)

because that doesn't give enough information. We want
to say that b is really of type e.g. "bytes containing
utf8 encoded text":

   u = unicode(b, 'utf8')

Here we're not thinking of the 'utf8' as selecting an
encoder or decoder, but of giving extra information
about the type of b, that isn't carried by b itself.

Now, going in the other direction, we might think to
write

   b = bytes(u, 'utf8')

But that wouldn't be right, because if we interpret this
consistently it would mean we're saying that u contains
utf8-encoded information, which is nonsense. What we
need is a way of saying "construct me something of type
'bytes containing utf8-encoded text'":

   b = bytes['utf8'](u)

Here I've coined the notation t[enc] which
evaluates to a callable object which constructs an
object of type t by encoding its argument according
to enc.

Now let's consider base64. Here, the roles of bytes
and unicode are reversed, because the bytes are just
bytes without any further interpretation, whereas
the unicode is really "unicode containing base64
encoded data". So we write

   u = unicode['base64'](b)   # encoding

   b = bytes(u, 'base64')     # decoding

Note that this scheme is reasonably idiot-proof, e.g.

   u = unicode(b, 'base64')

results in a type error, because this specifies
a decoding operation, and the base64 decoder takes
text as input, not bytes.

What happens with transformations where the input and
output types are the same? In this scheme, they're
not really the same any more, because we're providing
extra type information. Suppose we had a code called
'piglatin' which goes from unicode to unicode. The
types involved are really "text" and "piglatin-encoded
text", so we write

   u2 = unicode['piglatin'](u1)   # encoding

   u1 = unicode(u2, 'piglatin')   # decoding

Here you won't get any type error if you get things
backwards, but there's not much that can be done
about that. You just have to keep straight which
of your strings contain piglatin and which don't.

Is this scheme any better than having encode and
decode methods/functions? I'm not sure, but it
shows that a suitably enhanced notion of "data
type" can be used to replace the notions of
encoding and decoding and maybe reduce potential
confusion about which direction is which.

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiam!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+