[Python-checkins] python/nondist/sandbox/string alt292.py,NONE,1.1

Sat Aug 28 21:12:48 CEST 2004

Update of /cvsroot/python/python/nondist/sandbox/string
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28248

Added Files:
	alt292.py 
Log Message:
Proposed improvements to the PEP292 implementation.

* Returns a str if all inputs are a string.

* Error messages now include line number and token.

* Allows mapping in the form of keyword arguments.

* Use of locale specific alphabets for identifiers is now trapped
  and reported.  Formley it would match the ASCII substring and either
  raise a KeyError, make the wrong substitution, or be skipped.

* Improve presentation and commenting of the regular expression.

* Simplified the implementation where possible.

* Implement as a function rather than a class:
  - Eliminates separate (and possibly remote) instantiation and application
  - Allows the function to be self documenting and more referencable that %
  - Necessay for implementing correct return type and keyword arguments
  - Makes the code clearer and better serves as a model for other code.

* Add doctests to illustrate the improvements.

--- NEW FILE: alt292.py ---
r''' Doctests for PEP 292's string template functions

First, it is now a function and accepts either mappings or keyword arguments:

>>> dollarsub('the $xxx and', {'xxx':10})
'the 10 and'
>>> dollarsub('the $xxx and', xxx='10')
'the 10 and'

Next, it makes sure the return type is a str if all the inputs are a str.  Any unicode components will cause a unicode output.  This matches the behavior of other re and string ops:

>>> dollarsub('the $xxx and', xxx='10')
'the 10 and'
>>> dollarsub(u'the $xxx and', xxx='10')
u'the 10 and'
>>> dollarsub('the $xxx and', xxx=u'10')
u'the 10 and'
>>> dollarsub(u'the $xxx and', xxx=u'10')
u'the 10 and'

Non-strings are coerced to the type of the template:

>>> dollarsub('the $xxx and', xxx=10)
'the 10 and'
>>> dollarsub(u'the $xxx and', xxx=10)
u'the 10 and'

The ValueErrors are now more specific.  They include the line number and the mismatched token:

>>> t = """line one
... line two
... the $@malformed token
... line four"""
>>> dollarsub(t, {})
Traceback (most recent call last):
 . . .
ValueError: Invalid placeholder on line 3:  '@malformed'

Also, the re pattern was changed just a bit to catch an important class of locale specific errors where a user may use a non-ASCII identifier.  The previous implementation would match up to the first non-ASCII character and then return a KeyError if the abbreviated is (hopefully) found.  Now, it returns a value error highlighting the problem identifier.  Note, we still only accept Python identifiers but have improved error detection:

>>> import locale
>>> savloc = locale.setlocale(locale.LC_ALL)
>>> _ = locale.setlocale(locale.LC_ALL, 'spanish')
>>> t = u'Returning $ma\u00F1ana or later.'
>>> dollarsub(t, {})
Traceback (most recent call last):
 . . .
ValueError: Invalid placeholder on line 1:  u'ma\xf1ana'

>>> _ = locale.setlocale(locale.LC_ALL, savloc)

'''

import re as _re

# Search for $$, $identifier, ${identifier}, and any bare $'s
_pattern = _re.compile(r"""
  \$(\$)|                       # Escape sequence of two $ signs
  \$([_a-z][_a-z0-9]*(?!\w))|   # $ and a Python identifier
  \${([_a-z][_a-z0-9]*)}|       # $ and a brace delimited identifier
  \$(\S*)                       # Catchall for ill-formed $ expressions
""", _re.IGNORECASE | _re.VERBOSE | _re.LOCALE)
# Pattern notes:
#
# The pattern for $identifier includes a negative lookahead assertion
# to make sure that the identifier is not followed by a locale specific
# alphanumeric character other than [_a-z0-9].  The idea is to make sure
# not to partially match an ill-formed identifiers containing characters
# from other alphabets.  Without the assertion the Spanish word for
# tomorrow "ma~nana" (where ~n is 0xF1) would improperly match of "ma"
# much to the surprise of the end-user (possibly an non-programmer).
#
# The catchall pattern has to come last because it captures non-space
# characters after a dollar sign not matched by a previous group.  Those
# captured characters make the error messages more informative.
#
# The substitution functions rely on the first three patterns matching
# with a non-empty string.  If that changes, then change lines like
# "if named" to "if named is not None".

del _re

def dollarsub(template, mapping=None, **kwds):
    """A function for supporting $-substitutions."""
    if mapping is None:
        mapping = kwds
    def convert(mo):
        escaped, named, braced, catchall = mo.groups()
        if named or braced:
            return '%s' % mapping[named or braced]
        elif escaped:
            return '$'
        lineno = template.count('\n', 0, mo.start(4)) + 1
        raise ValueError('Invalid placeholder on line %d:  %r' %
                         (lineno, catchall))
    return _pattern.sub(convert, template)

def safedollarsub(template, mapping=None, **kwds):
    """A function for $-substitutions.

    This function is 'safe' in the sense that you will never get KeyErrors if
    there are placeholders missing from the interpolation dictionary.  In that
    case, you will get the original placeholder in the value string.
    """
    if mapping is None:
        mapping = kwds
    def convert(mo):
        escaped, named, braced, catchall = mo.groups()
        if named:
            try:
                return '%s' % mapping[named]
            except KeyError:
                return '$' + named
        elif braced:
            try:
                return '%s' % mapping[braced]
            except KeyError:
                return '${' + braced + '}'
        elif escaped:
            return '$'
        lineno = template.count('\n', 0, mo.start(4)) + 1
        raise ValueError('Invalid placeholder on line %d:  %r' %
                         (lineno, catchall))
    return _pattern.sub(convert, template)

if __name__ == '__main__':
    import doctest
    print 'Doctest results: ', doctest.testmod()