Parsing strings (\n and \\)

Fredrik Lundh fredrik at pythonware.com
Wed Jun 26 06:29:44 EDT 2002


Thomas Guettler wrote:

> Look at the two functoins quote and unquote. I wrote them
> without regular expression because I think it faster.

faster to write, perhaps.

and faster to run, if you only use them on strings with no
more than 2-3 characters.

but if you use a different set of test strings with more ordinary
characters than escaped characters, e.g.

     strings = ['foo', '', '\\', ' ', '"', '\\"', '\\\\']
     strings = [(x+"spamspamspamspamspam")*10 for x in strings]

you'll find that a RE approach can be much faster.  the following
version is about four times faster than your code, under 2.2:

def re_quote(string, sub=re.compile(r"[\\\"]").sub):
    def fixup(m):
        return "\\" + m.group(0)
    return sub(fixup, string)

def re_unquote(string, sub=re.compile(r"(?s)\\(.)|\\").sub):
    def fixup(m):
        ch = m.group(1)
        if ch is None:
            raise 'Parse Error: Backslash at end of string'
        if ch not in r"\\\"":
            raise 'Parse Error: unsupported character after backslash'
        return ch
    return sub(fixup, string)

:::

note the use of callbacks instead of substitution templates.  it's
usually faster (and in my opinion, also more pythonic) to use e.g.

    def fixup(m):
        return "spam %s %s" % m.group(1, 2)
    re.sub(pattern, fixup, string)

or, if you prefer lambdas:

    re.sub(pattern, lambda m: "spam %s %s" % m.group(1, 2), string)

than the re.sub non-standard interpolation syntax:

    re.sub(pattern, "spam \\1 \\2", string)

(and where possible, it's also slightly faster to use m.groups() instead
of enumerating all the groups in m.group(...))

ymmv, as usual.

</F>

<!-- (the eff-bot guide to) the python standard library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->





More information about the Python-list mailing list