[Python-Dev] PEP 460: allowing %d and %f and mojibake

Tue Jan 14 19:16:17 CET 2014

[Other readers: asciistr is at https://github.com/jeamland/asciicompat]

On Mon, Jan 13, 2014 at 11:44 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Right, asciistr is designed for a specific kind of hybrid API where
> you want to accept binary input (and produce binary output) *and* you
> want to accept text input (and produce text output). Porting those
> from Python 2 to Python 3 is painful not because of any limitations of
> the str or bytes API but because it's the only use case I have found
> where I actually *missed* the implicit interoperability offered by the
> Python 2 str type.

Yes, the use case is clear.

> It's not an implementation style I would consider appropriate for the
> standard library - we need to code very defensively in order to aid
> debugging in arbitrary contexts, so I consider having an API like
> urllib.parse demand 7-bit ASCII in the binary version, and require
> text to handle impure input to be a better design choice.

This surprises me. I think asciistr should strive to be useful for the
stdlib as well.

> However, in an environment where you can place greater preconditions
> on your inputs (such as "ensure all input data is ASCII compatible")

That gives me the Python 2 willies. :-(

> and you're willing to tolerate the occasional obscure traceback for
> particular kinds of errors,

Really? Can you give an example where the traceback using asciistr()
would be more obscure than using the technique you used in
urllib.parse?

> then it should be a convenient way to use
> common constants (like separators or URL scheme names) in an algorithm
> that can manipulate either binary or text, but not a combination of
> the two (the latter is still a nice improvement in correctness over
> Python 2, which allowed them to be mixed freely rather than requiring
> consistency across the inputs).

Unfortunately I suspect there are still examples where asciistr's
"submissive" behavior can produce surprises. E.g. consider a function
of two arguments that must either be both bytes or both str. It's
easily conceivable that for certain combinations of incorrect
arguments (i.e. one bytes and one str) the function doesn't raise an
error but returns something of one or the other type. (And this is
exactly the Python 2 outcome we're trying to avoid.)

> It's still slightly different from Python 2, though. In Python 2, the
> interaction model was:
>
>     str & str -> str
>     str & unicode -> unicode
>
> (with the one exception being str.format: that consistently produces
> str rather than promoting to Unicode)

Or raises good old UnicodeError. :-(

> My goal for asciistr is that it should exhibit the following behaviour:
>
>     str & asciistr -> str
>     asciistr & asciistr -> str (making it asciistr would be a pain and
> I don't have a use case for that)

I almost had one in the example code I sent in response to Greg.

>     bytes & asciistr -> bytes

I understand that '&' here stands for "any arbitrary combination", but
what about searches? Given that asciistr's base class is str, won't it
still blow up if you try to use it as an argument to e.g.
bytes.startswith()? Equality tests also sound problematic; is b'x' ==
asciistr('x') == 'x' ???

> So in code like that in urllib.parse (but in a more constrained
> context), you could just switch all your constants to asciistr, change
> your indexing operations to length 1 slices and then in theory
> essentially the same code that worked in Python 2 should also work in
> Python 3.

The more I think about this, the less I believe it's that easy. I
suspect you had the right idea when you mentioned singledispatch. It
might be easier to write the bytes version in terms of the string
versions wrapped in decode/encode, or vice versa, rather than trying
to reason out all the different combinations of str, bytes, asciistr.

> However, Benno is finding that my warning about possible
> interoperability issues was accurate - we have various places where we
> do PyUnicode_Check() rather than PyUnicode_CheckExact(), which means
> we don't always notice a PEP 3118 buffer interface if it is provided
> by a str subclass.

Not sure I understand this, but I believe him when he says this won't be easy.

> We'll look at those as we find them, and either
> work around them (if we can), decide not to support that behaviour in
> asciistr, or else I'll create a patch to resolve the interoperability
> issue.
>
> It's not necessarily a type I'd recommend using in production code, as
> there *will* always be a more explicit alternative that doesn't rely
> on a tricksy C extension type that only works in CPython. However,
> it's a type I think is worth having implemented and available on PyPI,
> even if it's just to disprove the claim that you *can't* write that
> kind of code in Python 3.

Hm. It is beginning to sound more and more flawed. I also worry that
it will bring back the nightmare of data-dependent UnicodeError back.
E.g. this (from tests/basic.py):

    def test_asciistr_will_not_accept_codepoints_above_127(self):
        self.assertRaises(ValueError, asciistr, 'Schrödinger')

looks reasonable enough when you assume asciistr() is always used with
a literal as argument -- but I suspect that plenty of people would
misunderstand its purpose and write asciistr(s) as a "clever" way to
turn a string into something that's compatible with both bytes and
strings... :-(

-- 
--Guido van Rossum (python.org/~guido)