[Python-Dev] PEP 460: allowing %d and %f and mojibake

Tue Jan 14 22:37:00 CET 2014

On 15 Jan 2014 04:16, "Guido van Rossum" <guido at python.org> wrote:
>
> [Other readers: asciistr is at https://github.com/jeamland/asciicompat]
>
> On Mon, Jan 13, 2014 at 11:44 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> > Right, asciistr is designed for a specific kind of hybrid API where
> > you want to accept binary input (and produce binary output) *and* you
> > want to accept text input (and produce text output). Porting those
> > from Python 2 to Python 3 is painful not because of any limitations of
> > the str or bytes API but because it's the only use case I have found
> > where I actually *missed* the implicit interoperability offered by the
> > Python 2 str type.
>
> Yes, the use case is clear.
>
> > It's not an implementation style I would consider appropriate for the
> > standard library - we need to code very defensively in order to aid
> > debugging in arbitrary contexts, so I consider having an API like
> > urllib.parse demand 7-bit ASCII in the binary version, and require
> > text to handle impure input to be a better design choice.
>
> This surprises me. I think asciistr should strive to be useful for the
> stdlib as well.

The concerns you raise are the reason I'm not sure that's possible - just
as in the Python 2 text model, I suspect actually *using* asciistr will
trade ease of development against robust detection of input errors.

I'm OK with that in a PyPI module, I'd be dubious about including it in the
standard library and making it a builtin is right out.

> > However, in an environment where you can place greater preconditions
> > on your inputs (such as "ensure all input data is ASCII compatible")
>
> That gives me the Python 2 willies. :-(

Yep - from a formal correctness point of view, asciistr is a terrible idea.
That's not the only consideration in coding though, or we'd all be using
statically typed languages :)

> > and you're willing to tolerate the occasional obscure traceback for
> > particular kinds of errors,
>
> Really? Can you give an example where the traceback using asciistr()
> would be more obscure than using the technique you used in
> urllib.parse?

In urllib.parse I do an up front check that everything is consistently
bytes or str. With asciistr it becomes tempting to skip that up front
check, so you instead get a TypeError about not being able to add str and
bytes.

Technically you could keep that up front check and only use asciistr as an
internal implementation detail, but at that point you may as well do things
properly and write the algorithm to operate solely on bytes or str and
convert the other inputs appropriately (which is the actual approach we use
in the standard library).

> > then it should be a convenient way to use
> > common constants (like separators or URL scheme names) in an algorithm
> > that can manipulate either binary or text, but not a combination of
> > the two (the latter is still a nice improvement in correctness over
> > Python 2, which allowed them to be mixed freely rather than requiring
> > consistency across the inputs).
>
> Unfortunately I suspect there are still examples where asciistr's
> "submissive" behavior can produce surprises. E.g. consider a function
> of two arguments that must either be both bytes or both str. It's
> easily conceivable that for certain combinations of incorrect
> arguments (i.e. one bytes and one str) the function doesn't raise an
> error but returns something of one or the other type. (And this is
> exactly the Python 2 outcome we're trying to avoid.)

Yep - that's why I consider asciistr to be firmly in the "power tool"
category. If you know what you're doing, it should let you write hybrid API
code that is just as concise as Python 2, but it's also far more error
prone than the core Python 3 text model.

I admit that's a key part of my motivation in trying to help Benno to
create it - I want to show that it's not that you *can't* write code that
way in Python 3, it's that there are good reasons why you *shouldn't*.

And in cases where those reasons don't apply... well, the aim in that case
is "pip install asciicompat" and away you go :)

> > It's still slightly different from Python 2, though. In Python 2, the
> > interaction model was:
> >
> >     str & str -> str
> >     str & unicode -> unicode
> >
> > (with the one exception being str.format: that consistently produces
> > str rather than promoting to Unicode)
>
> Or raises good old UnicodeError. :-(

Unless Benno fixed it in the last couple of days (which seems unlikely
given the complexity of the problem), asciistr currently has the Python 3
behaviour of interpolating the bytes repr() into the string rather than
trying to decode it. That's a key reason why it likely *won't* be a
substitute for PEP 460.

> > My goal for asciistr is that it should exhibit the following behaviour:
> >
> >     str & asciistr -> str
> >     asciistr & asciistr -> str (making it asciistr would be a pain and
> > I don't have a use case for that)
>
> I almost had one in the example code I sent in response to Greg.
>
> >     bytes & asciistr -> bytes
>
> I understand that '&' here stands for "any arbitrary combination", but
> what about searches? Given that asciistr's base class is str, won't it
> still blow up if you try to use it as an argument to e.g.
> bytes.startswith()? Equality tests also sound problematic; is b'x' ==
> asciistr('x') == 'x' ???

Yes, the aim is to take advantage of the fact that bytes generally
interoperates with anything that publishes a PEP 3118 buffer - the key
feature of asciistr is that it publishes the 8-bit segment from PEP 393 as
that buffer (the constructor checks that the max code point is 127 or less).

It's very CPython specific due to the tinkering with str internals, but the
idea is mostly to show that the semantics of such a type *can* still be
expressed relatively sensibly in Python 3, it's just not an approach that's
going to be applicable very often (most Python 3 native code will be able
to choose to be a binary or text API, so the need for this kind of hybrid
API design mostly affects APIs that started life in Python 2 and hence
still need to support both use cases).

> > So in code like that in urllib.parse (but in a more constrained
> > context), you could just switch all your constants to asciistr, change
> > your indexing operations to length 1 slices and then in theory
> > essentially the same code that worked in Python 2 should also work in
> > Python 3.
>
> The more I think about this, the less I believe it's that easy. I
> suspect you had the right idea when you mentioned singledispatch. It
> might be easier to write the bytes version in terms of the string
> versions wrapped in decode/encode, or vice versa, rather than trying
> to reason out all the different combinations of str, bytes, asciistr.

Yes - while I don't plan to *actually* switch the way urllib.parse works
away from the current higher order function approach (it ain't broke, so
there's nothing to fix), I do have a patch in progress that shows how it
would look using single dispatch instead.

Once I have that done, I'll post it somewhere as a demonstration and update
my binary protocol essay to suggest the additional option of using single
dispatch to process in the binary or text domain, with optional encoding
and decoding steps controlled by the type of the first input.

Also: after converting a function that takes a tuple where I wanted to
dispatch on the type of the first element, I suspect supporting a
"key=lambda args, kwds: type(args[0][0])" argument to singledispatch in
Python 3.5 might be a reasonable idea. On the other hand, I haven't
explored the possibility of a custom decorator yet, either, so we don't
need to do anything hasty :)

> > However, Benno is finding that my warning about possible
> > interoperability issues was accurate - we have various places where we
> > do PyUnicode_Check() rather than PyUnicode_CheckExact(), which means
> > we don't always notice a PEP 3118 buffer interface if it is provided
> > by a str subclass.
>
> Not sure I understand this, but I believe him when he says this won't be
easy.

Essentially, we *want* bytes to see asciistr as a buffer exporter, but in a
few places it goes "ah, a str subclass!" instead (which usually isn't what
we want).

> > We'll look at those as we find them, and either
> > work around them (if we can), decide not to support that behaviour in
> > asciistr, or else I'll create a patch to resolve the interoperability
> > issue.
> >
> > It's not necessarily a type I'd recommend using in production code, as
> > there *will* always be a more explicit alternative that doesn't rely
> > on a tricksy C extension type that only works in CPython. However,
> > it's a type I think is worth having implemented and available on PyPI,
> > even if it's just to disprove the claim that you *can't* write that
> > kind of code in Python 3.
>
> Hm. It is beginning to sound more and more flawed. I also worry that
> it will bring back the nightmare of data-dependent UnicodeError back.
> E.g. this (from tests/basic.py):
>
>     def test_asciistr_will_not_accept_codepoints_above_127(self):
>         self.assertRaises(ValueError, asciistr, 'Schrödinger')
>
> looks reasonable enough when you assume asciistr() is always used with
> a literal as argument -- but I suspect that plenty of people would
> misunderstand its purpose and write asciistr(s) as a "clever" way to
> turn a string into something that's compatible with both bytes and
> strings... :-(

Yep - while I do did plan to publish it on PyPI (with a big "actually using
this type may eat your data if you're not careful" warning), I'm also open
to the idea of just leaving it as a proof of concept on GitHub.

I don't see a lot of actual risk in publishing it though, and I think the
demonstrable risks encountered when attempting to use it do a reasonable
job of showing *why* we changed away from having a core 8-bit string type
that behaved that way.

Cheers,
Nick.

>
> --
> --Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140115/64cafb2c/attachment-0001.html>