[Python-Dev] bytes.from_hex()

Stephen J. Turnbull stephen at xemacs.org
Thu Mar 2 18:51:19 CET 2006


>>>>> "Greg" == Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

    Greg> But the base64 string itself *does* have text semantics.

What do you mean by that?  The strings of abstract "characters"
defined by RFC 3548 cannot be concatenated in general, they may only
be split at 4-character intervals, they can't be reliably searched as
text for a given octet or substring of the underlying binary object,
and deletion or insertion of octets can't be done without decoding and
re-encoding the whole string.  And of course humans can make neither
head nor tail of them in most cases.  The only useful semantics that
they have is "you can apply the base64 decoder" to them.

In other words, by far the most important effect of endowing that
string with "text semantics" is to force programmers to remember not
to use them.

Do you really mean to call that "text semantics"?

    Greg> To me this is no different than using a string of decimal
    Greg> digit characters to represent an integer, or a string of
    Greg> hexadecimal digit characters to represent a bit
    Greg> pattern. Would you say that those are not text, either?

"No different"?  OK, I'll take you at your word.<wink>

T2YgY291cnNlIEkgd291bGQgY29uc2lkZXIgdGhvc2UgdGV4dC4gIFRoZXkncmUgaHVtYW4t
cmVhZGFibGUu

    Greg> What about XML? What would you consider the proper data type
    Greg> for an XML document to be inside a Python program -- bytes
    Greg> or text?

Neither.  If I must chose one of those ... well, "I know I have a
choice of programming languages, and I won't be using Python for this
task."  Fortunately, there's ElementTree.

What you presumably meant was "what would you consider the proper type
for (P)CDATA?"  And my answer is "text" for text, and "bytes" for
binary data (eg, image or audio).  Let ElementTree handle the wire
format: if an Element's text attribute has type "bytes", convert to
base64 and then to the appropriate coded character set for the
channel.  I don't wanna know about the content transfer encoding, and
I should have no need to.

    Greg> You seem to want to reserve the term "text" for data that
    Greg> doesn't ever have to be understood even a little bit by a
    Greg> computer program, but that seems far too restrictive to me,
    Greg> and a long way from established usage.

What I want to reserve "text" for is data streams that nonprogrammer
humans might want to manipulate with pencil, paper, scissors, and
paste, or programmers with re and text[n:m] = text2.  I have no
objection to computers using it, too, and even asking us humans to
respect some restrictions on the use of [:]= and +.  But to tell us to
give up those operations entirely makes it into non-text IMO.

    Greg> [The] assumption [that the channel is ASCII-compatible] could
    Greg> be very wrong.  What happens if it turns out they really need
    Greg> to be encoded as UTF-16, or as EBCDIC?  All hell breaks
    Greg> loose, as far as I can see, unless the programmer has kept
    Greg> very firmly in mind that there is an implicit ASCII encoding
    Greg> involved.

    Greg> It's exactly to avoid the need for those kinds of mental
    Greg> gymnastics

Agreed, such bookkeeping would be annoying.  But there's no _need_ for
it any way you look at it: just leave binary objects as-is until
you're ready to put them on the wire.[1]  Attach a binary-to-wire codec
to this end of the wire, and inject your data there. This puts the
responsibility where it belongs: with the author of the wire driver.
That's the point, which you already mentioned: nobody but authors of
wire drivers[2] and introspective code will need to _explicitly_ call
.encode('base64').

    Greg> that Py3k will have a unified, encoding-agnostic data type
    Greg> for all character strings.

Yeah, but if base64 produces character strings, Unicode becomes a
unified, encoding-agnostic data type for all data.  Just base64
everything, and now we don't need a bytes type, right?

Note that this is precisely what Emacs/MULE does (with a variable
width non-Unicode internal encoding and "base256" instead of base64),
so as demented as it may sound, it's all too historically plausible.
And it can be implemented, by accident, at the application program
level.  Why expose our users to increased risk of such trouble?


Footnotes: 
[1]  Of course you may want to manipulate the binary data, even as
text.  But who's going to use the base64 format for that purpose?

[2]  I mean to include those who are writing the git.object_id(),
PGP_key.fingerprint(), and ElementTree.write() methods.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


More information about the Python-Dev mailing list