[Python-3000] More PEP 3101 changes incoming

Fri Aug 3 08:55:03 CEST 2007

Guido van Rossum wrote:
> My personal suggestion is to stay close to the .NET formatting language:
> 
>   name_specifier [',' width_specifier] [':' conversion_specifier]
> 
> where width_specifier is a positive or negative number giving the
> minimum width (negative for left-alignment) and conversion_specifier
> is passed uninterpreted to the object's __format__ method.

Before I comment on this I think I need to clear up a mismatch between 
your understanding of how __format__ works and mine. In particular, why 
it won't work for float and int to define a __format__ method.

Remember how I said in your office that it made sense to me there were 
two levels of format hooks in .Net? I realize that I wasn't being very 
clear at the time - as often happens when my thoughts are racing too 
fast for my mouth.

What I meant was that conceptually, there are two stages of 
customization, which I will call "pre-coercion" and "post-coercion" 
customization.

Before I explain what that means, let me say that I don't think that 
this is actually how .Net works, and I'm not proposing that there 
actually be two customization hooks. What I want to do is describe an 
abstract conceptual model of formatting, in which formatting occurs in a 
number of stages.

Pre-coercion formatting means that the real type of the value is used to 
control formatting. We don't attempt to convert the value to an int or 
float or repr() or anything - instead it's allowed to completely 
dominate the interpretation of the format codes. So the case of the 
DateTime object interpreting its specifiers as a stftime argument falls 
into this case.

In most cases, there won't be a pre-coercion hook. In which case the 
formatting proceeds to the next two stages, which are type coercion and 
then post-coercion formatting. The type coercion is driven be a 
*standard interpretation* of the format specifier. After the value is 
converted to the type, we then apply formatting that is specific to that 
type.

Now, I always envisioned that __format__ would allow reinterpretation of 
the format specifier. Therefore, __format__ fits into this model as a 
pre-coercion customization hook - it has to come *before* the type 
coercion, because otherwise type information would be destroyed and 
__format__ wouldn't work.

But the formatters for int and float have to happen *after* type 
coercion. Therefore, those formatters can't be the same as __format__.

> In order to support the use cases for %s and %r, I propose to allow
> appending a single letter 's', 'r' or 'f' to the width_specifier
> (*not* the conversion_specifier):
> 
>  'r' always calls repr() on the object;
>  's' always calls str() on the object;
>  'f' calls the object's __format__() method passing it the
> conversion_specifier, or if it has no __format__() method, calls
> repr() on it. This is also the default.
> 
> If no __format__() method was called (either because 'r' or 's' was
> used, or because there was no __format__() method on the object), the
> conversion_specifier (if given) is a *maximum* length; this handles
> the pretty common use cases of %.20s and %.20r (limiting the size of a
> printed value).
> 
> The numeric types are the main types that must provide __format__().
> (I also propose that for datetime types the format string ought to be
> interpreted as a strftime format string.) I think that
> float.__format__() should *not* support the integer formatting codes
> (d, x, o etc.) -- I find the current '%d' % 3.14 == '3' an abomination
> which is most likely an incidental effect of calling int() on the
> argument (should really be __index__()). But int.__format__() should
> support the float formatting codes; I think '%6.3f' % 12 should return
> ' 12.000'. This is in line with 1/2 returning 0.5; int values should
> produce results identical to the corresponding float values when used
> in the same context. I think this should be solved inside
> int.__format__() though; the generic formatting code should not have
> to know about this.

I don't agree that using the 'd' format type to print floats is an 
abomination, but that's because of a difference in design philosophy. 
I'm inclined to be permissive in this, because I don't see the benefit 
of being pedantic here, and I do see the potential usefulness of 
considering 'd' to be the same as 'f' with a precision of 0.

But that's a detail. I want to think about the larger picture.

Earlier I said that there were 6 attributes being controlled by the 
various specifiers, but based on the previous discussion there are 
actually 8, in no particular order:

    -- minimum width
    -- maximum width
    -- decimal precision
    -- alignment
    -- padding
    -- treatment of signs and negative numbers
    -- type coercion options
    -- number formatting options for a given type, such as exponential 
notation.

That seems a lot of parameters to cram into a lowly format string, and I 
can't imagine that anyone would like a system that requires these all to 
be specified individually. It would be cumbersome and hard to remember.

Fortunately, we recognize that these parameters are not all independent. 
Many combinations of parameters are nonsensical, especially when talking 
about non-number types. Therefore, we can can compress the visual 
specification of these attributes on a much smaller number of actual 
specified format codes.

Traditionally the C sprintf function has done two kinds of 
'multiplexing' of these codes. The first is to change the interpretation 
of a particular field (such as precision) based on the number formatting 
type. The second is to use letters to represent combinations of 
attributes - so for example the letter 'd' implies both that it's an 
integer type, and also how that integer type should be formatted.

So the challenge is to try and figure out how to represent all of the 
sensible permutations of formatting attributes in a way which is both 
intuitive and mnemonic.

There are two approaches to making this system programmer friendly: We 
can either try to invent the best possible system out of whole cloth, or 
we can steal from the past in the hopes that programmers who already 
know a previous syntax for format strings will be able to employ their 
prior knowledge.

If we decide to create a new system out of whole cloth, then what do we 
have to work with? Well, as I see it we have the following tools at our 
disposal for encoding meaning in a short form:

    -- Various delimiter characters: :,.!#$ and so on.
    -- Letters to represent one or more attributes.
    -- Numbers to represent scalar quantities
    -- The relative ordering of all of the above.

We also have to consider what it means to be 'intuitive'. In this case, 
we should consider that the various delimiter characters have 
connotations - such as the fact that '.' suggests a decimal point, or 
that '<' suggests a left-pointing arrow.

(I should also mention that "a:b,c" looks prettier to my eye than 
"a,b:c". There's a reason for this, and its because of Python syntax. 
Now, in Python, ':' isn't an operator - but if it was, you would have to 
consider its precedence to be very low. Because when we look at an 
expression 'if x: a,b' we know that comma binds more tightly than the 
colon, and so it's the same thing as saying 'if x: (a,b)'. But in any 
case this is purely an aesthetic digression and not terribly weighty.)

That's all I have to say for the moment - I'm still thinking this 
through. In any case, I think it's worthwhile to be scrutinizing this 
issue at a very low level and examining all of the assumptions.

-- Talin