"More About Unicode in Python 2 and 3"

Mon Jan 6 13:34:10 EST 2014

On 01/06/2014 09:27 AM, Steven D'Aprano wrote:
> Ethan Furman wrote:
>
> Chris didn't say "bytes and ascii data", he said "bytes and TEXT".
> Text != "ascii data", and the fact that some people apparently think it
> does is pretty much the heart of the problem.

The heart of a different problem, not this one.  The problem I refer to is that many binary formats have well-defined 
ascii-encoded text tidbits.  These tidbits were quite easy to work with in Py2, not difficult but not elegant in Py3, 
and even worse if you have to support both 2 and 3.

> Now, it is true that some of those bytes happen to fall into the same range
> of values as ASCII-encoded text. They may even represent text after
> decoding, but since we don't know what the file contents mean, we can't
> know that.

Of course we can -- we're the programmer, after all.  This is not a random bunch of bytes but a well defined format for 
storing data.

> It might be a mere coincidence that the four bytes starting at
> hex offset 40 is the C long 1095189760 which happens to look like "AGE"
> with a null at the end. For historical reasons, your hexdump utility
> performs that decoding step for you, which is why you can see "NAME"
> and "AGE" in the right-hand block, but that doesn't mean the file contains
> text. It contains bytes, some of which represents text after decoding.

As it happens, 'NAME' and 'AGE' are encoded, and will be decoded.  They could just as easily have contained tilde's, 
accents, umlauts, and other strange (to me) characters.  It's actually the 'C' and the 'N' that bug me (like I said, my 
example is minimal, especially compared to a network protocol).

And you're right -- it is easy to say FIELD_TYPE = slice(15,16), and it was also easy to say FIELD_TYPE = 15, but there 
is a critical difference -- can you spot it?

..
..
..
In case you didn't:  both work in Py2, only the slice version works (correctly) in Py3, but the worst part is why do I 
have to use a slice to take a single byte when a simple index should work?  Because the bytes type lies.  It shows, for 
example, b'\r\n\x12\x08N\x00' but when I try to access that N to see if this is a Numeric field I get:

--> b'\r\n\x12\x08N\x00'[4]
78

This is a cognitive dissonance that one does not expect in Python.

> If you (generic you) don't get that, you'll have a bad time. I mean *really*
> get it, deep down in the bone. The long, bad habit of thinking as
> ASCII-encoded bytes as text is the problem here.

Different problem.  The problem here is that bytes and byte literals don't compare equal.

> the average programmer has equally many years of thinking that the
> byte 41 "just is" the letter "A", and that's simply *wrong*.

Agreed.  But byte 41 != b'A', and that is equally wrong.

>> As I said earlier, my
>> example is minimal, but still very frustrating in
>> that normal operations no longer work.  Incidentally, if you were thinking
>> that NAME and AGE were part of the ascii text, you'd be wrong -- the field
>> names are also encoded, as are the Character and Memo fields.
>
> What Character and Memo fields? Are you trying to say that the NAME and AGE
> are *not* actually ASCII text, but a mere coincidence, like my example of
> 1095189760? Or are you referring to the fact that they're actually encoded
> as ASCII? If not, I have no idea what you are trying to say.

Yes, NAME and AGE are *not* ASCII text, but latin-1 encoded.  The C and the N are ASCII, meaningful as-is.  The actual 
data stored in a Character (NAME in this case) or Memo (not shown) field would also be latin-1 encoded.  (And before you 
ask, the encoding is stored in the file header.)

--
~Ethan~