[Python-Dev] PEP 460: allowing %d and %f and mojibake

Tue Jan 14 08:44:27 CET 2014

On 14 January 2014 16:04, Guido van Rossum <guido at python.org> wrote:
> On Mon, Jan 13, 2014 at 9:34 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> I've now looked at asciistr. (Thanks Glenn and Ethan for the link.)
>
> Now that I (hopefully) understand it, I'm worried that a text
> processing algorithm that uses asciistr might under hard-to-predict
> circumstances (such as when the arguments contain nothing of interest
> to the algorithm) might return an asciistr instance instead of a str
> or bytes instance, and this might confuse a caller (e.g. isinstance()
> checks might fail, dict lookups, or whatever -- it feels like the
> problem is similar to creating the perfect proxy type).

Right, asciistr is designed for a specific kind of hybrid API where
you want to accept binary input (and produce binary output) *and* you
want to accept text input (and produce text output). Porting those
from Python 2 to Python 3 is painful not because of any limitations of
the str or bytes API but because it's the only use case I have found
where I actually *missed* the implicit interoperability offered by the
Python 2 str type.

It's not an implementation style I would consider appropriate for the
standard library - we need to code very defensively in order to aid
debugging in arbitrary contexts, so I consider having an API like
urllib.parse demand 7-bit ASCII in the binary version, and require
text to handle impure input to be a better design choice.

However, in an environment where you can place greater preconditions
on your inputs (such as "ensure all input data is ASCII compatible")
and you're willing to tolerate the occasional obscure traceback for
particular kinds of errors, then it should be a convenient way to use
common constants (like separators or URL scheme names) in an algorithm
that can manipulate either binary or text, but not a combination of
the two (the latter is still a nice improvement in correctness over
Python 2, which allowed them to be mixed freely rather than requiring
consistency across the inputs).

It's still slightly different from Python 2, though. In Python 2, the
interaction model was:

    str & str -> str
    str & unicode -> unicode

(with the one exception being str.format: that consistently produces
str rather than promoting to Unicode)

My goal for asciistr is that it should exhibit the following behaviour:

    str & asciistr -> str
    asciistr & asciistr -> str (making it asciistr would be a pain and
I don't have a use case for that)
    bytes & asciistr -> bytes

So in code like that in urllib.parse (but in a more constrained
context), you could just switch all your constants to asciistr, change
your indexing operations to length 1 slices and then in theory
essentially the same code that worked in Python 2 should also work in
Python 3.

However, Benno is finding that my warning about possible
interoperability issues was accurate - we have various places where we
do PyUnicode_Check() rather than PyUnicode_CheckExact(), which means
we don't always notice a PEP 3118 buffer interface if it is provided
by a str subclass. We'll look at those as we find them, and either
work around them (if we can), decide not to support that behaviour in
asciistr, or else I'll create a patch to resolve the interoperability
issue.

It's not necessarily a type I'd recommend using in production code, as
there *will* always be a more explicit alternative that doesn't rely
on a tricksy C extension type that only works in CPython. However,
it's a type I think is worth having implemented and available on PyPI,
even if it's just to disprove the claim that you *can't* write that
kind of code in Python 3.

>> PEP 460 should actually make asciistr easier in the long run, as I now
>> expect we'll run into some "interesting" issues getting formatting to
>> produce anything other than text (contrary to what I said elsewhere in
>> these threads - I hadn't thought through the full implications at the
>> time).
>
> For example?

asciistr is a str subclass, so its formatting methods currently
operate in the text domain and produce str output. Getting it to do
otherwise is actually a task on the scale of implementing ASCII
interpolation operations on the native bytes type.

This realisation was the *other* factor that made me more comfortable
with the idea of adding ASCII interpolation to the core bytes type - I
previously thought asciistr could easily handle it, but it doesn't
(except in the pure ASCII case where it could theoretically just
encode at the end), thus also knocking out my "we can easily do this
in an extension type, there's no need to provide it in the builtins"
argument.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia