[Python-Dev] PEP 461: Adding % formatting to bytes and bytearray -- Final, Take 2

Wed Feb 26 04:57:14 CET 2014

Nick Coghlan writes that b'%a' is

 > the obvious way to interpolate representations of arbitrary objects
 > into binary formats that contain ASCII compatible segments.

The only argument that I have sympathy for is

 > %a *should* be allowed for consistency with text interpolation

although introduction of a new format character is a poor man's
consistency, and this is consistency for consistency's sake.  (I don't
have a big problem with that, though.  I *like* consistency!)

But TOOWTDI where I get off the bus.  I don't I agree that this
consistency is terribly useful, given how easy it is to

    def ascify(obj):
        # You could also do this with UTF-8.
        return ascii(obj).encode('ascii', errors='backslashescape')

I think the obvious way to interpolate representations of arbitrary
objects into binary formats that may contain ASCII-compatible
*segments* is a '__bytes__' method.  (Yes, I'm cheating, that's not
the sense of "arbitrary" Nick meant.  But see below for what happens
when I *do* consider Nick's sense of "arbitrary".)  If it makes sense
to represent an object using only ASCII bytes (eg, a BASE64 encoding
for binary blobs), why not a '__bytes__' method?  If non-ASCII-
compatible segments are allowed, why not use __repr__, or a '__bytes__'
method that gives you a full representation of the object (eg, a pickle)?

So we're really talking about formats that are 100% ASCII-compatible.
What are the use cases?  Debugging logs?  I don't see it.  As far as
human-readability goes, I read 100% incompatible-with-anything debug
logs (aka, containing Japanese in several of its 4 commonly-used
wire-format encodings) into XEmacs buffers regularly with no problems.
Decoding them can be a bitch, of course -- life would be simple if
only they *were* Python reprs!  Of course Emacsen provide a huge
amount of help with such things, but most of what I need to do would
work fine as long as the editor doesn't crash, has an ASCII printable
visual representation of non-printing-ASCII bytes, and allows both
truncation and wrap-at-screen-edge printing of long lines.

OTOH, maybe you have an automatic log-analysis tool or the like that
snafus on non-ASCII-compatible stuff.  if you are truly serious about
keeping your debug logs 100% ASCII-compatible (whether pure ASCII or
some ASCII-compatible "universal" encoding like UTF-8 or GB18030), you
really have your work cut out for you, especially if you want it to be
automatically parseable.  Ascification is the least of your worries.
Or you can do something like

    def log_debug_msg(msg_or_obj):
        write_to_log(ascify(msg_or_obj))

and get rid of the annoying "b" prefix on all your log message
formats, too!  YMMV, but *I* don't see debug logs as a plausible
justification.

The only plausible case I can think of is Glenn's web app where you
actually directly insert debug information into wire protocol destined
to appear in end-user output -- but then, this web app itself is only
usable in Kansas and other places where the nearest place that a
language other than Middle American English is spoken is a megameter
away.  Industrial strength frameworks will do that work using str, and
then .encode() to the user's requested encoding.  So this probably
isn't an app, but rather the web server itself (which speaks bytes to
clients, not text to users).  But then, typical reprs (whether
restricted to ASCII or not) have insufficient information about an
object to reproduce it.  Why is it a good idea to encourage people
writing objects to a debug log to use a broken-for-the-purpose repr?
(I can see it could go either way.  For example, if the alternative is
a "something went wrong" error message.  But I'd like to see a stronger
argument that a feature which is intended to encourage people to take
shortcuts -- and otherwise has no justification -- is Pythonic. :-)

The "inappropriate '__bytes__' method" seems to be a imaginary
bogeyman, in any case.  If people really want to dump arbitrary
objects (in Nick's sense) to a byte-oriented stream *outside* of the
stream's protocol, I think it would be easier to do that with 'ascify'
then by altering *every* class definition by adding 'ascify' as the
'__bytes__' definition.  Note that 'ascify' is fully general in case
you don't know what the type of the object you are dumping is;
'__bytes__' may not be.  Some objects may have existing incompatible
definitions for '__bytes__': eg, in HTTP, there's no problem with
sending an object in binary format, and it might very well be a
complex object with internal structure that gets flattened for
transmission (into a pickle, for example, or a Python list of frames
from a streaming server: "stream % frames[secs_to_frame(28):]").
Surely you're not going to replace that '__bytes__' with

    def __bytes__(self):
        return ascify(self)"

Steve