[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Mon Jan 13 05:12:33 CET 2014

Steven D'Aprano writes:

 > Of course you're right, but I have understood the above as being a 
 > sketch and not real code. (E.g. does "header" really mean the literal 
 > string "header", or does it stand in for something which is a header?) 
 > In real code, one would need to have some way of telling where the 
 > binary image data ends and the Unicode string begins.

Sure, but I think in Ethan's case it's probably out of band.  I have
been assuming out of band.

 > > This corrupts binary_image_data.  Each byte > 127 will be replaced by
 > > two bytes.
 > 
 > And reading it back using decode('utf-8') will replace those two bytes 
 > with a single byte, round-tripping exactly.

True, but I'm assuming Ethan himself didn't choose DBF format.

 > Of course if you encode to UTF-8 and then try to read the binary data as 
 > raw bytes, you'll get corrupted data. But do people expect to do this? 

People?  Real People use Python, they wouldn't do that. :-)  But the
app that forced Ethan to deal with DBF might.

 > > This kind of subtlety is precisely why MAL warned about use of latin1
 > > to smuggle bytes.
 > 
 > How would you smuggle a chunk of arbitrary bytes into a text string? 
 > Short of doing something like uuencoding it into ASCII, or
 > equivalent.

Arbitary bytes as a chunk?  I wouldn't do that, probably (see below),
and it's not possible in Python 3 at present (in str ASCII codes
always represent the corresponding ASCII character, they are never
uninterpreted bytes).

But if I know where the bytes are going to be in the str, I'd use
latin1 or (encoding='ascii', errors='surrogateescape') depending on
how well-controlled the processing is.  If I really "own" those bytes,
I might use latin1, and just "forget" all of the string-processing
functions that care about character identity (eg, case manipulation).
If the bytes might somehow end up leaking into the rest of the
program, I'd use surrogateescape and live with the doubled space usage.

But really, if it's not a wire-to-wire protocol kind of thing, I'd go
ahead and create a proper model for the data, and text would be text,
and chunks of arbitrary bytes would be bytes and integers would be
integers....