"More About Unicode in Python 2 and 3"

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jan 6 12:27:40 EST 2014


Ethan Furman wrote:

> On 01/05/2014 06:37 PM, Dan Stromberg wrote:
>>
>> The argument seems to be "3.x doesn't work the way I'm accustomed to,
>> so I'm not going to use it, and I'm going to shout about it until
>> others agree with me."
> 
> The argument is that a very important, if small, subset a data
> manipulation become very painful in Py3.  Not impossible, and not
> difficult, but painful because the mental model and the contortions needed
> to get things to work don't sync up
> anymore.  Painful because Python is, at heart, a simple and elegant
> language, but with the use-case of embedded ascii in binary data that
> elegance went right out the window.
> 
> On 01/05/2014 06:55 PM, Chris Angelico wrote:
>>
>> It can't be both things. It's either bytes or it's text.
> 
> Of course it can be:
> 
> 0000000: 0372 0106 0000 0000 6100 1d00 0000 0000  .r......a.......
> 0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 0000020: 4e41 4d45 0000 0000 0000 0043 0100 0000  NAME.......C....
> 0000030: 1900 0000 0000 0000 0000 0000 0000 0000  ................
> 0000040: 4147 4500 0000 0000 0000 004e 1a00 0000  AGE........N....
> 0000050: 0300 0000 0000 0000 0000 0000 0000 0000  ................
> 0000060: 0d1a 0a                                  ...
> 
> And there we are, mixed bytes and ascii data.  

Chris didn't say "bytes and ascii data", he said "bytes and TEXT".
Text != "ascii data", and the fact that some people apparently think it
does is pretty much the heart of the problem.

I see no mixed bytes and text. I see bytes. Since the above comes from a
file, it cannot be anything else but bytes. Do you think that a file that
happens to be a JPEG contains pixels? No. It contains bytes which, after
decoding, represents pixels. Same with text, ascii or otherwise.

Now, it is true that some of those bytes happen to fall into the same range
of values as ASCII-encoded text. They may even represent text after
decoding, but since we don't know what the file contents mean, we can't
know that. It might be a mere coincidence that the four bytes starting at
hex offset 40 is the C long 1095189760 which happens to look like "AGE"
with a null at the end. For historical reasons, your hexdump utility
performs that decoding step for you, which is why you can see "NAME"
and "AGE" in the right-hand block, but that doesn't mean the file contains
text. It contains bytes, some of which represents text after decoding.

If you (generic you) don't get that, you'll have a bad time. I mean *really*
get it, deep down in the bone. The long, bad habit of thinking as
ASCII-encoded bytes as text is the problem here. The average programmer has
years and years of experience thinking about decoding bytes to numbers and
back (just not by that name), so it doesn't lead to any cognitive
dissonance to think of hex 4147 4500 as either four bytes, two double-byte
ints, or a single four-byte int. But as soon as "text" comes into the
picture, the average programmer has equally many years of thinking that the
byte 41 "just is" the letter "A", and that's simply *wrong*.


> As I said earlier, my 
> example is minimal, but still very frustrating in
> that normal operations no longer work.  Incidentally, if you were thinking
> that NAME and AGE were part of the ascii text, you'd be wrong -- the field
> names are also encoded, as are the Character and Memo fields.

What Character and Memo fields? Are you trying to say that the NAME and AGE
are *not* actually ASCII text, but a mere coincidence, like my example of
1095189760? Or are you referring to the fact that they're actually encoded
as ASCII? If not, I have no idea what you are trying to say.



-- 
Steven




More information about the Python-list mailing list