"More About Unicode in Python 2 and 3"

Mon Jan 6 19:42:41 EST 2014

Ethan Furman wrote:

> On 01/06/2014 09:27 AM, Steven D'Aprano wrote:
>> Ethan Furman wrote:
>>
>> Chris didn't say "bytes and ascii data", he said "bytes and TEXT".
>> Text != "ascii data", and the fact that some people apparently think it
>> does is pretty much the heart of the problem.
> 
> The heart of a different problem, not this one.  The problem I refer to is
> that many binary formats have well-defined
> ascii-encoded text tidbits.  These tidbits were quite easy to work with in
> Py2, not difficult but not elegant in Py3, and even worse if you have to
> support both 2 and 3.

Many things are more difficult if you have to support a large range of
versions. That's life, for a programmer.

>> Now, it is true that some of those bytes happen to fall into the same
>> range of values as ASCII-encoded text. They may even represent text after
>> decoding, but since we don't know what the file contents mean, we can't
>> know that.
> 
> Of course we can -- we're the programmer, after all.  This is not a random
> bunch of bytes but a well defined format for storing data.

No, you misunderstand me. *You* may know what the data represents, but *we*
don't, because you just drop a hex dump in our laps with no explanation.

>> It might be a mere coincidence that the four bytes starting at
>> hex offset 40 is the C long 1095189760 which happens to look like "AGE"
>> with a null at the end. For historical reasons, your hexdump utility
>> performs that decoding step for you, which is why you can see "NAME"
>> and "AGE" in the right-hand block, but that doesn't mean the file
>> contains text. It contains bytes, some of which represents text after
>> decoding.
> 
> As it happens, 'NAME' and 'AGE' are encoded, and will be decoded.

You're either saying something utterly trivial, or something utterly
profound, and I can't tell which.

Of course they are encoded. The file doesn't contain the letter "N", it
contains the byte 0x4E. So what are you actually trying to say?

> They could just as easily have contained tilde's,
> accents, umlauts, and other strange (to me) characters.

I'm especially confused here because tildes are including in the ASCII
character set. Here's one here: ~ 

> It's actually the 
> 'C' and the 'N' that bug me (like I said, my example is minimal,
> especially compared to a network protocol).
> 
> And you're right -- it is easy to say FIELD_TYPE = slice(15,16), and it
> was also easy to say FIELD_TYPE = 15, but there is a critical difference
> -- can you spot it?
> 
> ..
> ..
> ..
> In case you didn't:  both work in Py2, only the slice version works
> (correctly) in Py3,

I accept that using the slice is inelegant. But lots of things are inelegant
when you do them them wrong way. Treating your textual data as bytes is the
wrong way. You apparently know that that your data is encoded text, you
apparently know the encoding... so why don't you just decode it and treat
it as text instead of insisting on dealing with the raw bytes?

Are you worried about performance? I'd be sympathetic if you were writing
some low-level network protocol stuff where performance is vital, but you
keep saying that your application is "minimal", which I interpret as
performance not being critical. So what's the deal?

> but the worst part is why do I
> have to use a slice to take a single byte when a simple index should work?

I don't understand the rationale for having byte indexing return an int
instead of a one-byte substring. Especially since we still have a perfectly
good way to extract the numeric value from a one-byte byte-string:

py> ord(b'N')
78

> Because the bytes type lies.  It shows, for example, b'\r\n\x12\x08N\x00'
> but when I try to access that N to see if this is a Numeric field I get:
> 
> --> b'\r\n\x12\x08N\x00'[4]
> 78
> 
> This is a cognitive dissonance that one does not expect in Python.

Yes, I agree. I think it was a terrible mistake to have bytes continue to
pretend to be ASCII. Having this occur:

py> print(b'\x4E')
b'N'

does nothing but muddy the water. I think it would be too much to
disallowing using ASCII literals in byte strings, but we shouldn't
*display* byte strings as ASCII.

py> print(b'N')  # This would be better.
b'\x4E'

[...]
> Different problem.  The problem here is that bytes and byte literals don't
> compare equal.

Right! Now I get where you are coming from.

-- 
Steven