"More About Unicode in Python 2 and 3"

Mon Jan 6 11:43:52 EST 2014

Ethan Furman wrote:

> Using my own project [1] as a reference:  good ol' dbf files -- character
> fields, numeric fields, logic fields, time fields, and of course the
> metadata that describes these fields and the dbf as a whole.  The
> character fields I turn into unicode, no sweat.  The metadata fields are
> simple ascii, and in Py2 something like `if header[FIELD_TYPE] == 'C'` did
> the job just fine.  In Py3 that compares an int (67) to the unicode letter
> 'C' and returns False.  

Why haven't you converted the headers to text too? You're using them as if
they were text. They might happen to merely contain the small subset of
Unicode which matches the ASCII encoding, but that in itself is no good
reason to keep it as bytes. If you want to work with stuff as if it were
text, convert it to text.

If you do have a good reason for keeping them as bytes, say because you need
to do a bunch of bitwise operations on it, it's not that hard to do the job
correctly: instead of defining FIELD_TYPE as 3 (for example), define it as
slice(3,4). Then:

    if header[FIELD_TYPE] == b'C':

will work. For sure, this is a bit of a nuisance, and slightly error-prone,
since Python won't complain if you forget the b prefix, it will silently
return False. Which is the right thing to do, inconvenient though it may be
in this case. But it is workable, with a bit of discipline.

Or define a helper, and use that:

    def eq(byte, char):
        return byte == ord(char)

    if eq(header[FIELD_TYPE], 'C'):

Worried about the cost of all those function calls, all those ord()'s? I'll
give you the benefit of the doubt and assume that this is not premature
optimisation. So do it yourself:

    C = ord('C')  # Convert it once.
    if header[FIELD_TYPE] == C:  # And use it many times.

[Note to self: when I'm BDFL, encourage much more compile-time
optimisations.]

> For me this is simply a major annoyance, but I 
> only have a handful of places where I have to deal with this.  Dealing
> with protocols where bytes is the norm and embedded ascii is prevalent --
> well, I can easily imagine the nightmare.

Is it one of those nightmares where you're being chased down an endless long
corridor by a small kitten wanting hugs? 'Cos so far I'm not seeing the
terror...

-- 
Steven