[Python-Dev] PEP 460 reboot

Tue Jan 14 20:23:40 CET 2014

On Tue, Jan 14, 2014 at 1:52 PM, Guido van Rossum <guido at python.org> wrote:
> On Tue, Jan 14, 2014 at 9:45 AM, Chris Barker <chris.barker at noaa.gov> wrote:
>> On Tue, Jan 14, 2014 at 9:29 AM, Yury Selivanov <yselivanov.ml at gmail.com>
>> wrote:
>>>
>>>  - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result.
>>
>>
>> please no -- that's the source of a lot of pain in py2 now.
>>
>> having a failure as a result of the value, rather than the type, of an
>> object just makes hard-to-test for bugs. Everything will be hunky dory for
>> development and testing, then in deployment some idiot ( ;-) ) will pass in
>> some non-ascii compatible string and you get  failure. And the person who
>> gets the failure doesn't understand why, or they wouldn't have passed in
>> non-ascii values in the first place...
>>
>> Ease of porting is nice, but let's not make it easy to port bug-prone code.
>
> Right. This is a big red flag to me as well.
>
> I think there is some inherent conflict between the extensible design
> of str.format() and the practical needs of people who are actually
> going to use formatting operations (either % or .format()) with bytes.
>
> The *practical* needs are mostly limited to supporting basic number
> formatting (decimal, hex, padding) and interpolation of anything that
> supports the buffer interface. It would also be nice if you didn't
> have to specify the type at all in the format string, i.e. {} should
> do the right thing for numbers and (all sorts of) bytes.
>
> But the way to arrive at this behavior without duplicating a whole lot
> of code seems to be to call the existing text-based __format__ API and
> convert the result to bytes -- for numbers this should be safe (their
> formatting produces just ASCII digits and a selected few other ASCII
> characters) but leads to an undesirable outcome for other types -- not
> just str but also e.g. lists or dicts containing str instances, since
> those call __repr__ on the contained items, and repr() may produce
> non-ASCII bytes.
>
> This is why my earlier proposal used ascii(), which is a "nerfed"(*)
> version of repr(). This does the right thing for numbers as well as
> for many other types (e.g. None, bool) and does something unpleasant
> for text strings that is perhaps better than the alternative.
>
> Which reminds me. Quite a few people have spoken out in favor of loud
> failures rather than silent "wrong" output. But I think that in the
> specific context of formatting output, there is a long and IMO good
> tradition of producing (slightly) wrong output in favor of more strict
> behavior. Consider for example what to do when a number doesn't fit in
> the given width. Would you rather raise an exception, truncate the
> value, or mess up the formatting? All languages newer than Fortran
> that I've used have chosen the latter, and I still agree it's a good
> idea. Similar with infinities, NaN, or None. (Yes, it's embarrassing
> to have a website displaying 'null'. But isn't a 500 even *more*
> embarrassing?)
>
> This doesn't mean I'm insensitive to the argument in favor of loud and
> early failure. It's just that I can see both sides of the coin, and
> I'm still deciding which argument is more important.
>
> (*) Gamer slang for a weapon made less dangerous. :-)

I think loud and early failure is important for porting while you
might still be trying to pound out the previously blurry encode/decode
boundaries. In this code str and bytes will be wrong everywhere. Some
APIs might return either str or bytes based on the input. Let it fail,
find the boundaries, and fix it until it does something useful without
failing. And it kindof depends on the context whether it is worse to
display weird ephemeral output or write the same weird output to long
term storage.

I'm not sure what to think about content-dependent failures on
protocols that are supposed to be ASCII-only-without-repr-noise.