[Python-Dev] Smuggling bytes into text (was Re: RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5)

Mon Jan 13 03:03:15 CET 2014

Changing the subject line to better describe what we're talking about. I 
hope it is of interest to others apart from Ethan and I -- mixed bytes 
and text is hard to get right. (And if I've got something wrong, I'd 
like to know about it.)

On Sat, Jan 11, 2014 at 08:38:49PM -0800, Ethan Furman wrote:
> On 01/11/2014 06:29 PM, Steven D'Aprano wrote:
[...]
> Since you're talking to me, it would be nice if you addressed the same 
> use-case I was addressing, which is mixed: ascii-encoded text, 
> ascii-encoded numbers, ascii-encoded bools, binary-encoded numbers, and 
> misc-encoded text.

I thought I had addressed it. But since your use-case is underspecified, 
please excuse me if I get some of it wrong.

> And no, your example will not work with any text, it would completely 
> moji-bake my dbf files.

I don't think it will. Admittedly, I don't know all the ins and outs of 
your files, but as far as I can tell, nothing you have said so far 
suggests that my plan will fail.

Code code speaks louder than words: http://www.pearwood.info/ethan_demo.py

This code produces a string containing smuggled bytes. There is:

- a header containing raw bytes;

- metadata consisting of the name of some encoding in ASCII;

- A series of tagged fields. Each field has a name, which is always 
  ASCII, and terminated with a colon. It is then followed by a 
  single ASCII character and some data:

  * T for some arbitrary chunk of text, encoded in the metadata 
    encoding, with a length byte prefix (that is, like a Pascal
    string);
  * F for a boolean flag "true" or "false" in ASCII;
  * N for an integer, a C long;
  * D for an integer, in ASCII, terminated at the first non-digit;
  * B for a chunk of arbitrary bytes, with a two-byte length prefix.

And the whole thing is written out to a file, then read back in, 
without data corruption or mojibake. I wrote this about 1am this 
morning, so it may or may not be a shining example of idiomatic Python 
code, but it works and is readable.

I understand that this won't match your actual use-case precisely, but I 
hope it contains the same sorts of mixed binary data and ASCII text that 
you're talking about. There are fixed width fields, variable length 
fields, binary fields, ASCII fields, non-ASCII text, and multiple 
encodings, all living in perfect harmony :-)

And it runs unchanged under both Python 2.7 and 3.3.

As so often happens, what seems good in principle is less useful in 
practce. Once I actually started writing code, I quickly moved beyond 
the simple model:

template = "some text"
data = template % ("text", 42, b'\x16foo'.decode('latin-1'))

that I thought would be easy to a more structured approach. So I wrote 
reader and writer classes and abstracted away the messy bits, although 
in truth none of it is very messy. The worst is dealing with the 2 
versus 3 differences, and even that requires only a handful of small 
helper functions.

I don't claim that the code I tossed together is the optimal design, or 
bug-free, or even that the exact same approach will work for your 
specific case. But it is enough to demonstrate that the basic idea is 
sound, you can process mixed text and bytes in a clean way, it doesn't 
generate mojibake, and can operate in both 2.7 and 3.3 without even 
using a __future__ directive.

> >>>Only the binary blobs need to be decoded. We don't need to encode the
> >>>template to bytes, and the textual data doesn't get encoded until we're
> >>>ready to send it across the wire or write it to disk.
> 
> No!  When I have text, part of which gets ascii-encoded and part of which 
> gets, say, cp1251 encoded, I cannot wait till the end!

I think we are talking about different textual data. It's a bit 
ambiguous, my apologies. You're talking about taking individual fields 
and deciding how to process them. I'm talking about doing your 
processing in the text domain, which means at the end of the process I 
have a Unicode string object rather than a bytes object. Before that str 
can be written to disk, it needs to be encoded.

> >>And what if your name field has data not representable in latin-1?
> >>
> >>--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
> >>u'\u0441\u0440\u0403'
> >
> >Where did you get those bytes from? You got them from somewhere.
> 
> For the sake of argument, pretend a user entered them in.
> 
> >Who knows? Who cares? Once you have bytes, you can treat them as a blob of
> >arbitrary bytes and write them to the record using the Latin-1 trick.
> 
> No, I can't.  See above.
>
> > If
> >you're reading those bytes from some stream that gives you bytes, you
> >don't have to care where they came from.
> 
> You're kidding, right?  If I don't know where they came from (a graphics 
> field?  a note field?) how am I going to know how to treat them?

As I understand it, you want the ability to store *arbitrary bytes* in 
the file, right? Here are nine arbitrary bytes:

b'\x82\xE1\xC2\0\0\x7B\0\xFF\xA8'

You don't need to know how I generated them, whether they are sound 
samples, data from a serial port, three RGB values, or some strange C 
struct. I need to know how to generate them, but you can treat them as 
an opaque blob. They're *already* bytes, you're not responsible for 
converting whatever the data was into bytes, because it's already done. 
It's just a blob of bytes as far as you're concerned. All you need to do 
is smuggle them into a text string.

> >But what if you don't start with bytes? If you start with a bunch of
> >floats, you'll probably convert them to bytes using the struct module.
> 
> Yup, and I do.
> 
> >If you start with non-ASCII text, you have to convert them to bytes too.
> >No difference here.
> 
> Really? 

Again, I fear I failed to explain myself in sufficient detail. If your 
non-ASCII text doesn't match the encoding specified, how else are you 
going to include it? See below.

> You just said above that "it will work with any text data" -- you 
> can't have it both ways.

I have been unclear, I apologise. Let me try again with an example.

As the end-user, I get to specify the encoding, that's what you said. 
Okay, I specify ISO-8859-7, which is Greek. Now obviously if I hand you 
a bunch of Russian letters in a string, and you try to encode them using 
ISO-8859-7, you're going to get an exception. That's okay, as presumably 
I'm sensible enough to only include characters which exist in the 
encoding I choose, and if not, its my own damn fault.

But suppose I have a reason for this strange behaviour. If I pre-encode 
those Russian letters to bytes, using (say) UTF-16, then I can hand you 
the raw bytes to store as a binary blob. Later, I get the binary blob 
back again, and I can decode them using UTF-16, to get the original 
Russian text back again. So long as you don't mangle the binary blob, 
the process is completely reversable.

That is what I am talking about.

> >You ask the user for their name, they answer "срЃ" which is given to you
> >as a Unicode string, and you want to include it in your data record. The
> >specifications of your file format aren't clear, so I'm going to assume
> >that:
> >
> >1) ASCII text is allowed "as-is" (that is, the name "George" will be
> >    in the final data file as b'George');
> 
> User data is not (typically) where the ASCII data is, but some of the 
> metadata is definitely and always ASCII.  The user text data needs to be 
> encoded using whichever codec is specified by the file, which is only 
> occasionally ASCII.
> 
> 
> >2) any other non-ASCII text will be encoded as some fixed encoding
> >    which we can choose to suit ourselves;
> 
> Well, the user chooses it, we have to abide by their choice.  (It's kept in 
> the file metadata.)
> 
> 
> >3) arbitrary binary data is allowed "as-is" (i.e. byte N has to end up
> >    being written as byte N, for any value of N between 0 and 255).
> 
> In a couple field types, yes.  Usually the binary data is numeric or date 
> related and there is conversion going on there, too, to give me the bytes I 
> need.

The above all sounds reasonable. But the following does not -- I think 
it shows some fundamental confusion on your part.

> [snip]
> 
> >>--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
> >>Traceback (most recent call last):
> >>   File "<stdin>", line 1, in <module>
> >>UnicodeEncodeError: 'latin-1' codec can't encode characters in position
> >>0-2: ordinal not in range(256)
> >
> >That is backwards to what I've shown. Look at my earlier example again:
> 
> And you are not paying attention:
> 
> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
> \--------------------------------------/  \-------------/
>  a non-ascii compatible unicode string      to latin1 bytes

You can't *decode* Unicode strings. Try it in Python 3, and it breaks:

py> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

For your code to work, you can't be using Python 3, you have to be using 
Python 2, where "..." is already bytes, not Unicode. Since it's a byte 
string, there's no point in decoding it into UTF-8, then encoding it 
back to bytes. All you are doing is running the risk of 
UnicodeEncodingError:

# Python 2.7 this time
py> '\xd0\x94'.decode('utf-8').encode('latin-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0414' in 
position 0: ordinal not in range(256)

Latin-1 does not work with arbitrary *characters*, but it does work with 
arbitrary *bytes*. You're trying to take a UTF-8 encoded byte string, 
decode back to arbitrary Unicode characters, then *encode* to Latin-1, 
which may fail.

What I am doing is taking arbitrary *bytes*, then *decode* to Latin-1 as 
a way of smuggling those bytes into a str.

> ("срЃ".encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 42, blob.decode('latin-1'))
>       \----------------------------------------------/  \--------------/
>                getting the actual bytes I need            and back into 
>                unicode until I write them later

In Python 3, that works, but I'm not sure if it does what you intend (I 
don't know what you intend). You have encode and decode the right way 
around this time, for Python 3 strings.

In Python 2, the interpreter (wrongly) accepts "срЃ" as a byte-string 
literal, but the results are poorly defined. What you actually get 
(probably) depends on your enviroment. On my system, I seem to get UTF-8 
encoded bytes, but that's not guaranteed.

> You did say to use a *text* template to manipulate my data, and then write 
> it later, no?  Well, this is what it would look like.

If the text strings the user gives you are compatible with the 
encoding they specify, you don't need that. Just use:

("срЃ", 42, blob.decode('latin-1'))

It's the user's responsibility if they choose to specify an encoding 
which is more restrictive than the contents of some field. If they do 
that, they have to encode that field somehow, so they can treat it as a 
binary blob. *You* don't have to do this, and you certainly don't have 
to take perfectly good text and turn it into bytes then back to text 
just so you can insert it back into text. That would be silly.

> >Bytes get DECODED to latin-1, not encoded.
> >
> >Bytes -> text is *decoding*
> >Text -> bytes is *encoding*
> 
> Pretend for a moment I know that, and look at my examples again.

Sorry to be harsh, but based on your swapping decode and encode around 
above in the examples above, I would have to pretend :-)

> I am demonstrating the contortions needed when my TEXTual data is not 
> ASCII-compatible:  It must be ENcoded using the appropriate codec to BYTES, 
> then DEcoded back to unicode using latin1, all so later I can ENcode the 
> bloomin' unicode data structure back to bytes using latin1 again.  Dizzy 
> yet?

No.

If I, the end user, insist on using a stupid legacy encoding, then *YES* 
absolutely of course I have to jump through hoops to store arbitrary 
Unicode characters using a legacy encoding that only supports a tiny 
subset of Unicode. This should not surprise you.

> And you must know this, because it is what your bytify function does.  Are 
> you trolling?

No.

-- 
Steven