[Python-Dev] Triple-quoted strings and indentation

Andrew Durdin adurdin at gmail.com
Wed Jul 6 11:45:52 CEST 2005


Here's the draft PEP I wrote up:

Abstract

    Triple-quoted string (TQS henceforth) literals in Python preserve
    the formatting of the literal string including newlines and 
    whitespace.  When a programmer desires no leading whitespace for 
    the lines in a TQS, he must align all lines but the first in the 
    first column, which differs from the syntactic indentation when a 
    TQS occurs within an indented block.  This PEP addresses this 
    issue.


Motivation

    TQS's are generally used in two distinct manners: as multiline 
    text used by the program (typically command-line usage information 
    displayed to the user) and as docstrings.

    Here's a hypothetical but fairly typical example of a TQS as a 
    multiline string:
    
        if not interactive_mode:
            if not parse_command_line():
                print """usage: UTIL [OPTION] [FILE]...

        try `util -h' for more information."""
                sys.exit(1)

    Here the second line of the TQS begins in the first column, which 
    at a glance appears to occur after the close of both "if" blocks.
    This results in a discrepancy between how the code is parsed and 
    how the user initially sees it, forcing the user to jump the 
    mental hurdle in realising that the call to sys.exit() is actually 
    within the second "if" block.
    
    Docstrings on the other hand are usually indented to be more 
    readable, which causes them to have extraneous leading whitespace 
    on most lines.  To counteract the problem, PEP 257 [1] specifies a 
    standard algorithm for trimming this whitespace.
    
    In the end, the programmer is left with a dilemma: either to align 
    the lines of his TQS to the first column, and sacrifice readability;
    or to indent it to be readable, but have to deal with unwanted
    whitespace.

    This PEP proposes that TQS's should have a certain amount of 
    leading whitespace trimmed by the parser, thus avoiding the 
    drawbacks of the current behaviour.
   

Specification

    Leading whitespace in TQS's will be dealt with in a similar manner 
    to that proposed in PEP 257:
    
        "... strip a uniform amount of indentation from the second
        and further lines of the [string], equal to the minimum 
        indentation of all non-blank lines after the first line.  Any 
        indentation in the first line of the [string] (i.e., up to 
        the first newline) is insignificant and removed.  Relative 
        indentation of later lines in the [string] is retained."

    Note that a line within the TQS that is entirely blank or consists 
    only whitespace will not count toward the minimum indent, and will 
    be retained as a blank line (possibly with some trailing whitespace).
        
    There are several significant differences between this proposal and
    PEP 257's docstring parsing algorithm:
    
    *   This proposal considers all lines to end at the next newline in
        the source code (whether escaped or not); PEP 257's algorithm
        only considers lines to end at the next (necessarily unescaped)
        newline in the parsed string.
        
    *   Only literal whitespace is counted; an escape such as \x20 
        will not be counted as indentation.
        
    *   Tabs are not converted to spaces.

    *   Blank lines at the beginning and end of the TQS will *not* be 
        stripped.

    *   Leading whitespace on the first line is preserved, as is 
        trailing whitespace on all lines.


Rationale

    I considered several different ways of determining
    the amount of whitespace to be stripped, including:
    
    1.  Determined by the column (after allowing for expanded tabs) of 
        the triple-quote:
            
            myverylongvariablename = """\
                                         This line is indented,
                                     But this line is not.
                                     Note the trailing newline:
                                     """
        
        +   Easily allows all lines to be indented.
        
        -   Easily leads to problems due to re-alignment of all but 
            first line when mixed tabs and spaces are used.
        
        -   Forces programmers to use a particular level of 
            indentation for continuing TQS's.
        
        -   Unclear whether the lines should align with the triple-
            quote or immediately after it.

        -   Not backward compatible with most non-docstrings.

    2.  Determined by the indent level of the second line of the 

        string:
    
            myverylongvariablename = """\
                This line is not indented (and has no leading newline),
                    But this one is.
                Note the trailing newline:
                """
    
        +   Allows for flexible alignment of lines.
        
        +   Mixed tabs and spaces should be fine (as long as they're 
            consistent).
        
        -   Cannot support an indent on the second line of the 
            string (very bad!).
            
        -   Not backward compatible with most non-docstrings.
    
    3.  Determined by the minimum indent level of all lines after the 

        first:
        
            myverylongvariablename = """\
                    This line is indented,
                But this line is not.
                Note the trailing newline:
                """
    
        +   Allows for flexible alignment of lines.
        
        +   Mixed tabs and spaces should be fine (as long as they're 
            consistent).

        +   Backward compatible with all docstrings and a majority of 
            non-docstrings

        -   Support for indentation on all lines not immediately 
            obvious

    Overall, solution 3 provided the best balance of features, and 
    (importantly) had the best backward compatibility.  I thus
    consider it the most suitable.


Examples

    The examples here are set out in pairs: the first of each pair 
    shows how the TQS must be currently written to avoid indentation 
    issues; the second shows how it can be written using this proposal 
    (although some variation is possible).  All examples are taken or 
    adapted from the Python standard library or another real source.
    
    1.  Command-line usage information:

        def usage(outfile):
            outfile.write("""Usage: %s [OPTIONS] <file> [ARGS]

        Meta-options:
        --help                Display this help then exit.
        --version             Output version information then exit.
        """ % sys.argv[0])

        #------------------------#
        
        def usage(outfile):
            outfile.write("""Usage: %s [OPTIONS] <file> [ARGS]

                Meta-options:
                --help                Display this help then exit.
                --version             Output version information then exit.
                """ % sys.argv[0])

    2.  Embedded Python code:

        self.runcommand("""if 1:
        import sys as _sys
        _sys.path = %r
        del _sys
        \n""" % (sys.path,))

        #------------------------#

        self.runcommand("""\
            if 1:
                import sys as _sys
                _sys.path = %r
                del _sys
                \n""" % (sys.path,))

    3.  Unit testing
    
        class WrapTestCase(BaseTestCase):
            def test_subsequent_indent(self):
                # Test subsequent_indent parameter
                 expect = '''\
          * This paragraph will be filled, first
            without any indentation, and then
            with some (including a hanging
            indent).'''
                result = fill(self.text, 40,
                              initial_indent="  * ",
                              subsequent_indent="    ")
                self.check(result, expect)

        #------------------------#
     
        class WrapTestCase(BaseTestCase):
            def test_subsequent_indent(self):
                # Test subsequent_indent parameter
                 expect = '''\
                      * This paragraph will be filled, first
                        without any indentation, and then
                        with some (including a hanging
                        indent).\
                    '''
                result = fill(self.text, 40,
                              initial_indent="  * ",
                              subsequent_indent="    ")
                self.check(result, expect)

     Example 3 illustrates how indentation of all lines (by 2 spaces) 
     is achieved with this proposal: the position of the closing 
     triple quote is used to determine the minimum indentation for the 
     whole string.  To avoid a trailing newline in the string, the 
     final newline is escaped.  Example 2 avoids the need for this 
     construction by placing the first line (which is not indented) on 
     the line after the triple-quote, and escaping the leading 
     newline.


Backwards Compatibility

    Uses of TQS's fall into two broad categories: those where 
    indentation is significant, and those where it is not.  Those in 
    the latter (larger) category, which includes all docstrings, will 
    remain effectively unchanged under this proposal.  Docstrings in 
    particular are usually trimmed according to the rules in PEP 257 
    before their value is used; the trimmed strings will be the same 
    under this proposal as they are now.
    
    Of the former category, the majority are those which have at least 
    one line beginning in the first column of the source code; these 
    will be entirely unaffected if left alone, but may be reformatted 
    to increase readability (see example 1 above).  However a small 
    number of strings in this first category depend on all lines (or 
    all but the first) being indented.  Under this proposal, these 
    will need to be edited to ensure that the intended amount of 
    whitespace is preserved.  Examples 2 and 3 above show two 
    different ways to reformat the strings for these cases.  Note that 
    in both examples, the overall indentation of the code is cleaner, 
    producing more readable code.
    
    Some evidence may be desired to support the claims made above 
    regarding the distribution of the different uses of TQS's.  I have 
    begun some analysis to produce some statistics for these; while 
    still incomplete, I have some initial results for the Python 2.4.1 
    standard library (these figures should not be off by more than a 
    small margin):
    
    In the standard library (some 396,598 lines of Python code), there 
    are 7,318 occurrences of TQS's, an average rate of one per 54 
    lines.  Of these, 6,638 (90.7%) are docstrings; the remaining 680 
    (9.3%) are not.  A further examination shows that 
    only 64 (0.9%) of these have leading indentation on all lines (the
    only case where the proposed solution is not backward compatible).
    These must be manually checked to determine 
    whether they will be affected; such a check reveals only 7-15 
    TQS's (0.1%-0.2%) that actually need to be edited.

    Although small, the impact of this proposal on compatibility is 
    still more than negligible; if accepted in principle, it might be 
    better suited to be initially implemented as a __future__ feature, 
    or perhaps relegated to Python 3000.
    

Implementation

    An implementation for this proposal has been made; however I have 
    not yet made a patch file with the changes, nor do the changes yet 
    extend to the documentation or other affected areas.


References

    [1] PEP 257, Docstring Conventions, David Goodger, Guido van Rossum
        http://www.python.org/peps/pep-0257.html


Copyright

    This document has been placed in the public domain.


More information about the Python-Dev mailing list