[Python-Dev] PEP 460 reboot

Guido van Rossum guido at python.org
Mon Jan 13 19:58:24 CET 2014


Let me try rebooting the reboot.

My interpretation of Nick's argument is that he are asking for a bytes
formatting language that doesn't have an implicit ASCII assumption.

To me this feels absurd. The formatting codes (%s, %c) themselves are
expressed as ASCII characters. If you include anything else in the
format string besides formatting codes (e.g. b'<%s>'), you are giving
it as ASCII characters. I don't know what characters the EBCDIC codes
37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but
it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded.

If I had some byte strings in an unknown encoding (but the same
encoding for all) that I needed to concatenate I would never think of
'%s%s' % (x, y) -- I would write x+y. (Even in Python 2.)

If I see some code using *any* formatting operation (regardless of
whether it's %d, %r, %s or %c) I am going to assume that there is some
ASCII-ness, and if there isn't, the code's author has obscured their
goal to me.

I hear the objections against b'%s' % 'x' returning b"'x'" loud and
clear, and if the noise about that sub-issue is preventing folks from
seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
use %b which would require its argument to be bytes. Those bytes
should still probably be ASCII-ish, but there's no way to test that.
That's fine with me and should be fine to Nick as well -- PEP 460
doesn't check that your encodings match (how could it? :-), nor does
plain string concatenation using +.

In my head I make the following classification of situations where you
work with bytes and/or text.

(A) Pure binary formats (e.g. most IP-level packet formats, media
files, .pyc files, tar/zip files, compressed data, etc.). These are
handled using the struct module (e.g. tar/zip) and/or custom C
extensions (e.g. gzip).

(B) Encoded text. Here you should just decode everything into str
objects and parse your text at that level. If you really want to
manipulate the data as bytes (e.g. because you have a lot of data to
process and very light processing) you may be able to do it, but
unless it's a verbatim copy, you are probably going to make
assumptions about the encoding. You are also probably going to mess up
for some encodings (e.g. leave BOM turds in the middle of a file).

(C) Loosely text-based protocols and formats that have an ASCII
assumption in the spec. Most classic Internet protocols (FTP, SMTP,
HTTP, IRC, etc.) fall in this category; I expect there are also plenty
of file formats using similar conventions (e.g. mailbox files). These
protocols and formats often require text-ish manipulations, e.g. for
case-insensitive headers or commands, or to split things at
whitespace. This is where I find uses for the current ASCII-assuming
bytes operations (e.g. b.lower(), b.split(), but also int(b)) and
where the lack of number formatting (especially %d and %x) is most
painful. I see no benefit in forcing the programmer writing such
protocol code handling to use more cumbersome ways of converting
between numbers and bytes, nor in forcing them to insert an
encoding/decoding layer -- these protocols often switch between text
and binary data at line boundaries, so the most basic part of parsing
(splitting the input into lines) must still happen in the realm of
bytes.

IMO PEP 460 and the mindset that goes with it don't apply to any of
these three cases.

Also, IMO requiring a new type to handle (C) also seems adding too
much complexity, and adds to porting efforts. I may have felt
differently in the past, but ATM I feel that if newer versions of
Python 3 make porting of Python 2 code easier, through minor
compromises, that's a *good* thing. (Example: adding u"..." literals
to 3.3.)

-- 
--Guido van Rossum (python.org/~guido)


More information about the Python-Dev mailing list