[Python-Dev] PEP 460: allowing %d and %f and mojibake

Mon Jan 13 11:48:50 CET 2014

Ethan Furman writes:

 > The part that you don't seem to acknowledge (sorry if I missed it)
 > is that there are str-like methods already on bytes.

I haven't expressed myself well, but I don't much care about that.
It's what Knuth would classify as a seminumerical method.  What I do
care about is that the methods that convert other types to text
(including format) not work for bytes.  That's where I consider text
to "start".

 > > is *exactly* the Python 2 model of text.  But you deny that the
 > > effect of your proposals (eg, b"%d" % (12,)) is to reintroduce Python
 > > 2's bytes/character confusion, don't you?
 > 
 > Given that the default (and only) text type in Py3 is str, which is
 > unicode, I don't think any confusion will be as severe, but I
 > acknowledge that there could be some.

I fear it will be quite severe where I live, in Shift JIS/GB18030
land.  (The two most obnoxious encodings known to man, except perhaps
the syntax of Brainf!ck.)

 > >> *My* definition is not ambiguous at all.  If this particular part
 > >> of the byte stream is defined to contain ASCII-encoded text, then I
 > >> can use the bytes text methods to work with it.
 > >
 > > But how is Python supposed to know that?
 > 
 > Python doesn't need to.

... because you know it.  But the ideal of object-oriented programming
(and duck-typing) is that you shouldn't need to; the object should
know how to produce appropriate behavior itself.

 > > But under your definition, you need to make the decision, or
 > > explicitly code the decision, on the basis of context.
 > 
 > Exactly so.  I even have to do that in Py2.

"Even."  This is exactly where PBP and EIBTI part company, I think.
EIBTI thinks its a bad idea to pass around bytes that are implicitly
some other type, and Python 3 *should be good enough to make that
unnecessary*.  I'm convinced, and Nick is convinced, that we can make
that true for 90% of the cases that it isn't now, if we could just
figure out what's hard about the use cases where Python 3 isn't up to
snuff yet (and figure out which use cases we need to handle to get us
up to 90%!)

PBP doesn't think it's a great idea to pass around bytes that are
implicitly some other type, but didn't mind it (or got used to it) in
Python 2, and so they're not looking at that as a problem that Python
3 can solve.  They're looking at Python 3 as the problem that prevents
them from doing what worked fine in Python 2.  I understand that point
of view, I just think we should be able to do better in Python 3, and
should give it a serious try before giving in.  Remember, "Special
cases aren't special enough to break the rules" comes *before*
"Although practicality beats purity".  Not to forget that "Explicit is
better than implicit" is second[1] on the list. ;-)

After looking at this thread, I feel that (due to misunderstandings on
both sides) purity hasn't really been tried yet.

 > >> If that particular configuration of bytes is because it's
 > >> ASCII-encoded text, then sure.
 > >
 > > Once again, you are advocate precisely the Python 2 model of text.
 > 
 > Not exactly, because what I get back is bytes, which cannot
 > directly be mixed with unicode (str) as it was in Py2.  I think
 > this is a key difference.

You're in good company there; that was Guido's rationale for not
worrying, too.  I agree it's "key" (and I'm sure Nick will, on
reflection if not already).  But I worry (a lot) that it's not enough.

 > This confuses me somewhat.  It's okay to use b'ethan'.upper(),
 > which only makes semantic sense as ASCII-encoded text,

Not really OK.  In theory, because it doesn't require serialization/
encoding of a primitive type, it doesn't matter.  In practice, without
powerful formatting, it isn't even a major attraction.  In practice,
with powerful formatting, it adds to the attraction.

Note that regex doesn't require type conversions (matches have methods
to return positions in the target or subsequences of the target, not
values of other types), which is why I (and I suspect Nick for the
same reason) am comfortable with polymorphic regex but not with bytes
formatting.

 > (Aside, I'm perfectly comfortable with "ASCII-encoded text" because
 > if you took u'ethan'.encode('ascii') you would get b'ethan'.  If it
 > was some other encoding, such as cp1251, I would call that
 > particular byte stream "cp1251-encoded text".

Even though "ethan" is perfectly good ASCII-encoded text (as well as
the integer 435,744,694,638 on a bigendian machine with 5-byte words,
and you have no way of knowing whether it was user data (CP1251) or a
metadata keyword (ASCII) or be the US national debt in 1967 dollars
(integer) when b'ethan' shows up in a trace?

 > And if there were methods that worked directly on a cp1251-encoded
 > byte stream I would not have any problem using them on
 > cp1251-encoded text.)

I was afraid of that: all of those methods (except the case methods[2])
will work fine on a cp1251-encoded text.  And because they only know
that the string is bytes, the case methods will silently corrupt your
"text" as soon as they get a chance.  That bothers me, even if it
doesn't bother you.  Purity again, if you like.  (But you'd take a
safe .upper if you got it for free, no?)

 > Okay, I've thought somewhat.  Under the definition above would it
 > be fair to say that Db3Table (a class in my dbf module) is a
 > boundary type?  It sits between the actual file and the program,
 > and transforms bytes into actual Python types.

Yes, I'd call that a boundary type.

Footnotes: 
[1]  Yes, I know what's number 1, but I'm not going to mention it out
loud!

[2]  Arguably those too, since bytes don't have a locale.  They're in
C locale and the bytes >127 don't have semantics like case.