"More About Unicode in Python 2 and 3"

Chris Angelico rosuav at gmail.com
Mon Jan 6 10:46:08 EST 2014


On Tue, Jan 7, 2014 at 2:10 AM, Ethan Furman <ethan at stoneleaf.us> wrote:
> On 01/05/2014 06:55 PM, Chris Angelico wrote:
>>
>>
>> It can't be both things. It's either bytes or it's text.
>
>
> Of course it can be:
>
> 0000000: 0372 0106 0000 0000 6100 1d00 0000 0000  .r......a.......
> 0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 0000020: 4e41 4d45 0000 0000 0000 0043 0100 0000  NAME.......C....
> 0000030: 1900 0000 0000 0000 0000 0000 0000 0000  ................
> 0000040: 4147 4500 0000 0000 0000 004e 1a00 0000  AGE........N....
> 0000050: 0300 0000 0000 0000 0000 0000 0000 0000  ................
> 0000060: 0d1a 0a                                  ...
>
> And there we are, mixed bytes and ascii data.  As I said earlier, my example
> is minimal, but still very frustrating in that normal operations no longer
> work.  Incidentally, if you were thinking that NAME and AGE were part of the
> ascii text, you'd be wrong -- the field names are also encoded, as are the
> Character and Memo fields.

That's alternating between encoded text and non-text bytes. Each
individual piece is either text or non-text, not both. The ideal way
to manipulate it would most likely be a simple decode operation that
turns this into (probably) a dictionary, decoding both the
structure/layout and UTF-8 in a single operation. But a less ideal
(and more convenient) solution might be involving what's currently
under discussion elsewhere: a (possibly partial) percent-formatting or
.format() method for bytes.

None of this changes the fact that there are bytes used to
store/transmit stuff, and abstract concepts used to manipulate them.
Just like nobody expects to be able to write a dict to a file without
some form of encoding (pickle, JSON, whatever), you shouldn't expect
to write a character string without first turning it into bytes.

ChrisA



More information about the Python-list mailing list