[Python-Dev] PEP 460: allowing %d and %f and mojibake

Mon Jan 13 07:46:14 CET 2014

On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:
> Glenn Linderman writes:
>
>   > the proposals to embed binary in Unicode by abusing Latin-1
>   > encoding.
>
> Those aren't "proposals", they are currently feasible techniques in
> Python 3 for *some* use cases.
>
> The question is why infecting Python 3 with the byte/character
> confoundance virus is preferable to such techniques, especially if
> their (serious!) deficiencies are removed by creating a new type such
> as asciistr.
"smuggled binary" (great term borrowed from a different subthread) 
muddies the waters of what you are dealing with. As long as the actual 
data is only Latin-1 and smuggled binary, the technique probably isn't 
too bad... you can define the the "smuggled binary" as a "decoding" of 
binary to text, sort of like base64 "decodes" binary to ASCII. And it 
can be a useful technique.

As soon as you introduce "smuggled non-ASCII, non-Latin-1 text" 
encodings into the mix, it gets thoroughly confusing... just as 
confusing as the Python 2 text model. It takes decode+encode to do the 
smuggled text, plus encode push it to the boundary, plus you have text 
that you know is text, but because of the required techniques for 
smuggling it, you can't operate on it or view it properly as the text 
that it should be.

The "byte/character confoundance virus" is a hobgoblin of paranoid 
perception.  In another post, I pointed out that

''' b"%d" % 25 '''  is not equivalent to  ''' "%d" % 25 ''' because of 
the "b" in the first case. So the "implicit" encoding that everyone on 
that side of the fence was talking about was not at all implicit, but 
explicit.  The numeric characters produced by %d are clearly in the 
ASCII subset of text, so having b"%d" % 25 produce pre-encoded ASCII 
text is explicit and practical.

My only concern was what  b"%s" % 'abc'  should do, because in general, 
str may not contain only ASCII.  (generalize to  b"%s" % str(...)  ).  
Guido solved that one nicely.  Of course, at this point, I could punt 
the whole argument off to "Guido said so", but since you asked me, I 
felt it appropriate to respond from my perspective... and I'm not sure 
Guido specifically addressed your smuggled binary proposal.

When the mixture of text and binary is done as encoded text in binary, 
then it is obvious that only limited text processing can be performed, 
and getting the text there requires that it was encoded (hopefully 
properly encoded per the binary specification being created) to become 
binary. And there are no extra, confusing Latin-1 encode/decode 
operations required.

 From a higher-level perspective, I think it would be great to have a 
module, perhaps called "boundary" (let's call it that for now), that 
allow some definition syntax (augmented BNF? augmented ABNF?) to explain 
the format of a binary blob. And then provide methods for generating and 
parsing it to/from Python objects. Obviously, the ABNF couldn't 
understand Python objects; instead, Python objects might define the ABNF 
to which they correspond, and methods for accepting binary and producing 
the object (factory method?) and methods for generating the binary.  As 
objects build upon other objects, the ABNF to which the correspond could 
be constructed, and perhaps even proven to be capable of parsing all 
valid blobs corresponding to the specification, and perhaps even proven 
to be capable of generating only valid blobs (although I'm not a 
software proof guru; last I heard there were definite limits on the 
ability to do proofs, but maybe this is a limited enough domain it could 
work).

Then all blobs could be operated on sort of like web browsers operate on 
the DOM, or some XML parsing libraries, by defining each blob as a 
collection of objects for the pieces. XML is far too wordy for practical 
use (but hey! it is readable) but perhaps it could be practical if 
tokenized, and then the tokenized representation could be converted to a 
DOM just like XML and HTML are. (this is mostly to draw the parallel in 
the parsing and processing techniques; I'm not seriously suggesting a 
binary version of XML, but there is a strong parallel, and it could be 
done).  Given a DOM-like structure, a validator could be written to 
operate on it, though, to provide, if not a proof, at least a sanity 
check. And, given the DOM-like structure, one call to the top-level 
object to generate the blob format would walk over all of them, 
generating the whole blob.

Off I go, drifting into Python ideas.... but I have a program I want to 
rewrite that could surely use some of these techniques (and probably 
will), because it wants to read several legacy formats, and produce 
several legacy formats, as well as a new, more comprehensive format.  So 
the objects will be required to parse/generate 4 different blob 
structures, one of which has its own set of several legacy variations.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140112/6be8d6c4/attachment.html>