[Python-Dev] PEP 460: allowing %d and %f and mojibake

Sun Jan 12 22:28:15 CET 2014

On 01/12/2014 12:02 PM, Stephen J. Turnbull wrote:
> Georg Brandl writes:
>> Antoine writes:
>>>
>>> . . . if it weren't for your stupid maximalist opposition. . .
>>
>> Can you please stop throwing personal insults around?  You don't have to
>> resort to that level.
>
> Ethan's posts (as an example of one general trend in this thread) are
> pretty frustrating, you have to admit.

Two points:

1) Are you saying it's okay to be insulting when frustrated?  I also find this mega-thread frustrating, but I'm trying 
very hard not to be insulting.

2) If you are going to use my name, please be certain of the facts [1].  More below.

> MAL posted straight out the Python 2 model of text makes it easier for
> him to write some programs, so he's all for reintroducing it.  And
> that is the whole truth of the matter.  Although I disagree with him,
> I appreciate his honesty.

If you have an example of me lying (even if it's just a possibility), please refer to it directly so I can either try to 
explain the misunderstanding or apologize.

> But people keep posting "we don't want Python 2's confounding of text
> and binary, we just want bytes with (nearly) all the functionality of
> strings [because they are (partially|really) encoded text]".  Some of
> them actually use the literal word "text" in their justification!

In only one case did I use the word "text" loosely, and that was when I claimed that Py2 had three text types, and Py3 
had two.  I was wrong, I apologize.  Py3 has one definite text type, str, and, I claim, one half text type in bytes, 
because bytes itself provides ASCII text processing methods.  If you have a better term for the notion of 
b'ethan'.title() --> b'Ethan' than ASCII-text processing, I'll use that instead.  If there are good reasons to not allow 
further concessions to the ASCII-ness of bytes (and you provide a good one below) then that makes living with the 
handicap easier.  But don't lie to me (as Nick tried to) and say that "In particular, the bytes type is, and always will 
be, designed for pure binary manipulation" when it has methods like .center().

If I am wrong, and that was not a lie, please explain it to me.

> That's, well, what would you call it?  Either they know what they're
> saying, in which case it's disingenuous at best, or they don't know
> what they're saying, in which case it's a proposal based on a clear
> misunderstanding of the situation.

I think some of the misunderstanding (which you also seem to suffer from) is that we (or at least I) /ever/ want a 
unicode string back from bytes interpolation.  I don't!  If I start with bytes, I want bytes back!  And I have a very 
clear grasp on the difference between str and bytes and what ACSII encoding means, it was a hard and painful lesson for 
me and I'm not likely to forget it.

To summarize, I used the term text when referring to unicode text (str), ASCII or ASCII-encoded text to refer to bytes 
that are to be used in a place that requires ASCII bytes for communication (such as content length or field type).  I do 
/not/ use ASCII to refer to any ol' collection of bytes that happens to look like it might be ASCII-encoded text.

>The problem is not going to go
> away just because they *say* they don't want to reintroduce Python 2
> text processing.  That is precisely what this proposal is *intended*
> to do, whether in the limited form proposed by Antoine or in the much
> more extensive form that folks like Ethan want.
>
> What "maximalists" mean is that they promise not to abuse Python 2
> text processing when writing Python 3 programs.  This promise is
> highly unlikely to be kept for two reasons.  First, they can't make
> that promise on behalf of third parties, who for various reasons
> certainly will abuse these features to avoid the encoded-text-to-
> Unicode-text and vice-versa conversions.

I concede that this is a good reason to not allow % interpolation.  Kinda like not allowing sum on strings.

And I don't make promises for other people, and abusing this feature would be a bug.

> Second, I doubt they
> themselves will keep the promise to my satisfaction because their
> definition of "text" is ambiguous.

*My* definition is not ambiguous at all.  If this particular part of the byte stream is defined to contain ASCII-encoded 
text, then I can use the bytes text methods to work with it.  The only time I would return a bytes object is if it was 
supposed to be bytes (an image, for example); otherwise I return a bool, an int, a float, a date, or, even, a str.

> When it's convenient for them to
> use text-processing operations on bytes, they'll say "oh, yes, these
> are conventionally considered text-processing features, but that's
> just an accident of the particular configuration of bytes -- yup,
> bytes -- I'm processing."

If that particular configuration of bytes is because it's ASCII-encoded text, then sure.  To use, for example, 
bytes.__upper__ on data that wasn't ASCII-encoded text (even if it happened to look like it was) would be the height of 
stupidity.  Please don't include me in such accusations.

> But Nick's important example of web frameworks demonstrates the
> problem: unless they convert to text where appropriate, they're just
> pushing the problem off on application writers.  Sometimes passing on
> data as bytes is appropriate, of course, but the framework authors are
> likely to be biased in favor of doing that, and it's not hard to
> imagine frameworks ported from Python 2 passing on the problem
> wholesale on the grounds that "we returned str in Python 2 which is
> bytes in Python 3, and since we were processing bytes the whole time,
> we see no reason to change the 'ABI'."  Of course the application
> writers thought they were receiving text "in an inconvenient and
> ambiguous form".  IMO, with the proposed changes, that is likely to
> continue indefinitely, negating some of the gains I expected to
> receive from Python 3. :-(

This would be a good reason to reject PEP 460, if that danger was deemed more likely than the good it would bring.

> Note: there are a lot of high-level frameworks like Django that even
> in Python 2 basically went to Unicode everywhere internally.  I don't
> deny that.  I think that Python 3 as currently constituted makes it a
> lot easier to make an appropriate decision of where to convert, and
> should take some of the burden off the high-level frameworks.
> Approving this PEP, especially in a maximalist form, will blur the
> lines.

I understand your point, but I disagree.  When I open a file (in binary mode, obviously, as otherwise I'd get massive 
corruption) I get back a bunch of bytes.  When working with tcp, I get back a bunch of bytes.  bytes are /already/ the 
boundary type.  If we have to make a third type for proper boundary processing it's an admission that bytes failed in 
its role.

--
~Ethan~

[1] I double-checked all my posts on this topic both here and on Python Ideas to make sure.