[Python-3000] PEP 3101 update

Sun Jun 11 03:07:05 CEST 2006

Here's the latest PEP 3101 - I've incorporated changes based on 
suggestions from a lot of folks. This version incorporates:

   -- a detailed specification for conversion type fields
   -- description of error handling behavior
   -- 'strict' vs. 'lenient' error handling flag
   -- compound field names
   -- braces are now escaped using {{ instead of \{

--------------------------------------------------------------------
PEP: 3101
Title: Advanced String Formatting
Version: $Revision: 46845 $
Last-Modified: $Date: 2006-06-10 17:59:06 -0700 (Sat, 10 Jun 2006) $
Author: Talin <talin at acm.org>
Status: Draft
Type: Standards
Content-Type: text/plain
Created: 16-Apr-2006
Python-Version: 3.0
Post-History: 28-Apr-2006, 6-May-2006, 10-Jun-2006

Abstract

     This PEP proposes a new system for built-in string formatting
     operations, intended as a replacement for the existing '%' string
     formatting operator.

Rationale

     Python currently provides two methods of string interpolation:

     - The '%' operator for strings. [1]

     - The string.Template module. [2]

     The scope of this PEP will be restricted to proposals for built-in
     string formatting operations (in other words, methods of the
     built-in string type).

     The '%' operator is primarily limited by the fact that it is a
     binary operator, and therefore can take at most two arguments.
     One of those arguments is already dedicated to the format string,
     leaving all other variables to be squeezed into the remaining
     argument.  The current practice is to use either a dictionary or a
     tuple as the second argument, but as many people have commented
     [3], this lacks flexibility.  The "all or nothing" approach
     (meaning that one must choose between only positional arguments,
     or only named arguments) is felt to be overly constraining.

     While there is some overlap between this proposal and
     string.Template, it is felt that each serves a distinct need,
     and that one does not obviate the other.  In any case,
     string.Template will not be discussed here.

Specification

     The specification will consist of the following parts:

     - Specification of a new formatting method to be added to the
       built-in string class.

     - Specification of a new syntax for format strings.

     - Specification of a new set of class methods to control the
       formatting and conversion of objects.

     - Specification of an API for user-defined formatting classes.

     - Specification of how formatting errors are handled.

     Note on string encodings: Since this PEP is being targeted
     at Python 3.0, it is assumed that all strings are unicode strings,
     and that the use of the word 'string' in the context of this
     document will generally refer to a Python 3.0 string, which is
     the same as Python 2.x unicode object.

     If it should happen that this functionality is backported to
     the 2.x series, then it will be necessary to handle both regular
     string as well as unicode objects.  All of the function call
     interfaces described in this PEP can be used for both strings
     and unicode objects, and in all cases there is sufficient
     information to be able to properly deduce the output string
     type (in other words, there is no need for two separate APIs).
     In all cases, the type of the template string dominates - that
     is, the result of the conversion will always result in an object
     that contains the same representation of characters as the
     input template string.

String Methods

     The build-in string class will gain a new method, 'format',
     which takes takes an arbitrary number of positional and keyword
     arguments:

         "The story of {0}, {1}, and {c}".format(a, b, c=d)

     Within a format string, each positional argument is identified
     with a number, starting from zero, so in the above example, 'a' is
     argument 0 and 'b' is argument 1.  Each keyword argument is
     identified by its keyword name, so in the above example, 'c' is
     used to refer to the third argument.

Format Strings

     Brace characters ('curly braces') are used to indicate a
     replacement field within the string:

         "My name is {0}".format('Fred')

     The result of this is the string:

         "My name is Fred"

     Braces can be escaped by doubling:

         "My name is {0} :-{{}}".format('Fred')

     Which would produce:

         "My name is Fred :-{}"

     The element within the braces is called a 'field'.  Fields consist
     of a 'field name', which can either be simple or compound, and an
     optional 'conversion specifier'.

Simple and Compound Field Names

     Simple field names are either names or numbers. If numbers, they
     must be valid base-10 integers; if names, they must be valid
     Python identifiers.  A number is used to identify a positional
     argument, while a name is used to identify a keyword argument.

     A compound field name is a combination of multiple simple field
     names in an expression:

         "My name is {0.name}".format(file('out.txt'))

     This example shows the use of the 'getattr' or 'dot' operator
     in a field expression. The dot operator allows an attribute of
     an input value to be specified as the field value.

     The types of expressions that can be used in a compound name
     have been deliberately limited in order to prevent potential
     security exploits resulting from the ability to place arbitrary
     Python expressions inside of strings. Only two operators are
     supported, the '.' (getattr) operator, and the '[]' (getitem)
     operator.

     An example of the 'getitem' syntax:

         "My name is {0[name]}".format(dict(name='Fred'))

     It should be noted that the use of 'getitem' within a string is
     much more limited than its normal use. In the above example, the
     string 'name' really is the literal string 'name', not a variable
     named 'name'. The rules for parsing an item key are the same as
     for parsing a simple name - in other words, if it looks like a
     number, then its treated as a number, if it looks like an
     identifier, then it is used as a string.

     It is not possible to specify arbitrary dictionary keys from
     within a format string.

Conversion Specifiers

     Each field can also specify an optional set of 'conversion
     specifiers' which can be used to adjust the format of that field.
     Conversion specifiers follow the field name, with a colon (':')
     character separating the two:

         "My name is {0:8}".format('Fred')

     The meaning and syntax of the conversion specifiers depends on the
     type of object that is being formatted, however many of the
     built-in types will recognize a standard set of conversion
     specifiers.

     Conversion specifiers can themselves contain replacement fields.
     For example, a field whose field width it itself a parameter
     could be specified via:

         "{0:{1}}".format(a, b, c)

     Note that the doubled '}' at the end, which would normally be
     escaped, is not escaped in this case.  The reason is because
     the '{{' and '}}' syntax for escapes is only applied when used
     *outside* of a format field. Within a format field, the brace
     characters always have their normal meaning.

     The syntax for conversion specifiers is open-ended, since except
     than doing field replacements, the format() method does not
     attempt to interpret them in any way; it merely passes all of the
     characters between the first colon and the matching brace to
     the various underlying formatter methods.

Standard Conversion Specifiers

     If an object does not define its own conversion specifiers, a
     standard set of conversion specifiers are used.  These are similar
     in concept to the conversion specifiers used by the existing '%'
     operator, however there are also a number of significant
     differences.  The standard conversion specifiers fall into three
     major categories: string conversions, integer conversions and
     floating point conversions.

     The general form of a standard conversion specifier is:

         [[fill]align][sign][width][.precision][type]

     The brackets ([]) indicate an optional field.

     Then the optional align flag can be one of the following:

         '<' - Forces the field to be left-aligned within the available
               space (This is the default.)
         '>' - Forces the field to be right-aligned within the
               available space.
         '=' - Forces the padding to be placed between immediately
               after the sign, if any. This is used for printing fields
               in the form '+000000120'.

     Note that unless a minimum field width is defined, the field
     width will always be the same size as the data to fill it, so
     that the alignment option has no meaning in this case.

     The optional 'fill' character defines the character to be used to
     pad the field to the minimum width.  The alignment flag must be
     supplied if the character is a number other than 0 (otherwise the
     character would be interpreted as part of the field width
     specifier). A zero fill character without an alignment flag
     implies an alignment type of '='.

     The 'sign' field can be one of the following:

         '+'  - indicates that a sign should be used for both
                positive as well as negative numbers
         '-'  - indicates that a sign should be used only for negative
                numbers (this is the default behaviour)
         ' '  - indicates that a leading space should be used on
                positive numbers
         '()' - indicates that negative numbers should be surrounded
                by parentheses

     'width' is a decimal integer defining the minimum field width. If
     not specified, then the field width will be determined by the
     content.

     The 'precision' field is a decimal number indicating how many
     digits should be displayed after the decimal point.

     Finally, the 'type' determines how the data should be presented.
     If the type field is absent, an appropriate type will be assigned
     based on the value to be formatted ('d' for integers and longs,
     'g' for floats, and 's' for everything else.)

     The available string conversion types are:

         's' - String format. Invokes str() on the object.
               This is the default conversion specifier type.
         'r' - Repr format. Invokes repr() on the object.

     There are several integer conversion types. All invoke int() on
     the object before attempting to format it.

     The available integer conversion types are:

         'b' - Binary. Outputs the number in base 2.
         'c' - Character. Converts the integer to the corresponding
               unicode character before printing.
         'd' - Decimal Integer. Outputs the number in base 10.
         'o' - Octal format. Outputs the number in base 8.
         'x' - Hex format. Outputs the number in base 16, using lower-
               case letters for the digits above 9.
         'X' - Hex format. Outputs the number in base 16, using upper-
               case letters for the digits above 9.

     There are several floating point conversion types. All invoke
     float() on the object before attempting to format it.

     The available floating point conversion types are:

         'e' - Exponent notation. Prints the number in scientific
               notation using the letter 'e' to indicate the exponent.
         'E' - Exponent notation. Same as 'e' except it uses an upper
               case 'E' as the separator character.
         'f' - Fixed point. Displays the number as a fixed-point
               number.
         'F' - Fixed point. Same as 'f'.
         'g' - General format. This prints the number as a fixed-point
               number, unless the number is too large, in which case
               it switches to 'e' exponent notation.
         'G' - General format. Same as 'g' except switches to 'E'
               if the number gets to large.
         'n' - Number. This is the same as 'g', except that it uses the
               current locale setting to insert the appropriate
               number separator characters.
         '%' - Percentage. Multiplies the number by 100 and displays
               in fixed ('f') format, followed by a percent sign.

     Objects are able to define their own conversion specifiers to
     replace the standard ones.  An example is the 'datetime' class,
     whose conversion specifiers might look something like the
     arguments to the strftime() function:

         "Today is: {0:a b d H:M:S Y}".format(datetime.now())

Controlling Formatting

     A class that wishes to implement a custom interpretation of its
     conversion specifiers can implement a __format__ method:

     class AST:
         def __format__(self, specifiers):
             ...

     The 'specifiers' argument will be either a string object or a
     unicode object, depending on the type of the original format
     string.  The __format__ method should test the type of the
     specifiers parameter to determine whether to return a string or
     unicode object.  It is the responsibility of the __format__ method
     to return an object of the proper type.

     string.format() will format each field using the following steps:

      1) See if the value to be formatted has a __format__ method.  If
         it does, then call it.

      2) Otherwise, check the internal formatter within string.format
         that contains knowledge of certain builtin types.

      3) Otherwise, call str() or unicode() as appropriate.

User-Defined Formatting Classes

     There will be times when customizing the formatting of fields
     on a per-type basis is not enough.  An example might be an
     accounting application, which displays negative numbers in
     parentheses rather than using a negative sign.

     The string formatting system facilitates this kind of application-
     specific formatting by allowing user code to directly invoke
     the code that interprets format strings and fields.  User-written
     code can intercept the normal formatting operations on a per-field
     basis, substituting their own formatting methods.

     For example, in the aforementioned accounting application, there
     could be an application-specific number formatter, which reuses
     the string.format templating code to do most of the work. The
     API for such an application-specific formatter is up to the
     application; here are several possible examples:

         cell_format("The total is: {0}", total)

         TemplateString("The total is: {0}").format(total)

     Creating an application-specific formatter is relatively straight-
     forward.  The string and unicode classes will have a class method
     called 'cformat' that does all the actual work of formatting; The
     built-in format() method is just a wrapper that calls cformat.

     The type signature for the cFormat function is as follows:

         cformat(template, format_hook, args, kwargs)

     The parameters to the cformat function are:

         -- The format template string.
         -- A callable 'format hook', which is called once per field
         -- A tuple containing the positional arguments
         -- A dict containing the keyword arguments

     The cformat function will parse all of the fields in the format
     string, and return a new string (or unicode) with all of the
     fields replaced with their formatted values.

     The format hook is a callable object supplied by the user, which
     is invoked once per field, and which can override the normal
     formatting for that field.  For each field, the cformat function
     will attempt to call the field format hook with the following
     arguments:

        format_hook(value, conversion)

     The 'value' field corresponds to the value being formatted, which
     was retrieved from the arguments using the field name.

     The 'conversion' argument is the conversion spec part of the
     field, which will be either a string or unicode object, depending
     on the type of the original format string.

     The field_hook will be called once per field. The field_hook may
     take one of two actions:

         1) Return a string or unicode object that is the result
            of the formatting operation.

         2) Return None, indicating that the field_hook will not
            process this field and the default formatting should be
            used.  This decision should be based on the type of the
            value object, and the contents of the conversion string.

Error handling

     The string formatting system has two error handling modes, which
     are controlled by the value of a class variable:

        string.strict_format_errors = True

     The 'strict_format_errors' flag defaults to False, or 'lenient'
     mode. Setting it to True enables 'strict' mode. The current mode
     determines how errors are handled, depending on the type of the
     error.

     The types of errors that can occur are:

     1) Reference to a missing or invalid argument from within a
     field specifier. In strict mode, this will raise an exception.
     In lenient mode, this will cause the value of the field to be
     replaced with the string '?name?', where 'name' will be the
     type of error (KeyError, IndexError, or AttributeError).

     So for example:

         >>> string.strict_format_errors = False
         >>> print 'Item 2 of argument 0 is: {0[2]}'.format( [0,1] )
         "Item 2 of argument 0 is: ?IndexError?"

     2) Unused argument. In strict mode, this will raise an exception.
     In lenient mode, this will be ignored.

     3) Exception raised by underlying formatter. These exceptions
     are always passed through, regardless of the current mode.

Alternate Syntax

     Naturally, one of the most contentious issues is the syntax of the
     format strings, and in particular the markup conventions used to
     indicate fields.

     Rather than attempting to exhaustively list all of the various
     proposals, I will cover the ones that are most widely used
     already.

     - Shell variable syntax: $name and $(name) (or in some variants,
       ${name}).  This is probably the oldest convention out there, and
       is used by Perl and many others.  When used without the braces,
       the length of the variable is determined by lexically scanning
       until an invalid character is found.

       This scheme is generally used in cases where interpolation is
       implicit - that is, in environments where any string can contain
       interpolation variables, and no special subsitution function
       need be invoked.  In such cases, it is important to prevent the
       interpolation behavior from occuring accidentally, so the '$'
       (which is otherwise a relatively uncommonly-used character) is
       used to signal when the behavior should occur.

       It is the author's opinion, however, that in cases where the
       formatting is explicitly invoked, that less care needs to be
       taken to prevent accidental interpolation, in which case a
       lighter and less unwieldy syntax can be used.

     - Printf and its cousins ('%'), including variations that add a
       field index, so that fields can be interpolated out of order.

     - Other bracket-only variations.  Various MUDs (Multi-User
       Dungeons) such as MUSH have used brackets (e.g. [name]) to do
       string interpolation.  The Microsoft .Net libraries uses braces
       ({}), and a syntax which is very similar to the one in this
       proposal, although the syntax for conversion specifiers is quite
       different. [4]

     - Backquoting.  This method has the benefit of minimal syntactical
       clutter, however it lacks many of the benefits of a function
       call syntax (such as complex expression arguments, custom
       formatters, etc.).

     - Other variations include Ruby's #{}, PHP's {$name}, and so
       on.

     Some specific aspects of the syntax warrant additional comments:

     1) Backslash character for escapes.  The original version of
     this PEP used backslash rather than doubling to escape a bracket.
     This worked because backslashes in Python string literals that
     don't conform to a standard backslash sequence such as '\n'
     are left unmodified. However, this caused a certain amount
     of confusion, and led to potential situations of multiple
     recursive escapes, i.e. '\\\\{' to place a literal backslash
     in front of a bracket.

     2) The use of the colon character (':') as a separator for
     conversion specifiers.  This was chosen simply because that's
     what .Net uses.

Sample Implementation

     A rough prototype of the underlying 'cformat' function has been
     coded in Python, however it needs much refinement before being
     submitted.

Backwards Compatibility

     Backwards compatibility can be maintained by leaving the existing
     mechanisms in place.  The new system does not collide with any of
     the method names of the existing string formatting techniques, so
     both systems can co-exist until it comes time to deprecate the
     older system.

References

     [1] Python Library Reference - String formating operations
     http://docs.python.org/lib/typesseq-strings.html

     [2] Python Library References - Template strings
     http://docs.python.org/lib/node109.html

     [3] [Python-3000] String formating operations in python 3k
         http://mail.python.org/pipermail/python-3000/2006-April/000285.html

     [4] Composite Formatting - [.Net Framework Developer's Guide]

http://msdn.microsoft.com/library/en-us/cpguide/html/cpconcompositeformatting.asp?frame=true

Copyright

     This document has been placed in the public domain.

Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: