using re module to find " but not " alone ... is this a BUG in re?

Paul McGuire ptmcg at austin.rr.com
Fri Jun 13 09:46:21 EDT 2008


On Jun 12, 4:11 am, anton <anto... at gmx.de> wrote:
> Hi,
>
> I want to replace all occourences of " by \" in a string.
>
> But I want to leave all occourences of \" as they are.
>
> The following should happen:
>
>   this I want " while I dont want this \"
>
> should be transformed to:
>
>   this I want \" while I dont want this \"
>
> and NOT:
>
>   this I want \" while I dont want this \\"
>

A pyparsing version is not as terse as an re, and certainly not as
fast, but it is easy enough to read.  Here is my first brute-force
approach to your problem:

    from pyparsing import Literal, replaceWith

    escQuote   = Literal(r'\"')
    unescQuote = Literal(r'"')
    unescQuote.setParseAction(replaceWith(r'\"'))

    test1 = r'this I want " while I dont want this \"'
    test2 = r'frob this " avoid this \", OK?'

    for test in (test1, test2):
        print (escQuote | unescQuote).transformString(test)

And it prints out the desired:

    this I want \" while I dont want this \"
    frob this \" avoid this \", OK?

This works by defining both of the patterns escQuote and unescQuote,
and only defines a transforming parse action for the unescQuote.  By
listing escQuote first in the list of patterns to match, properly
escaped quotes are skipped over.

Then I looked at your problem slightly differently - why not find both
'\"' and '"', and replace either one with '\"'.  In some cases, I'm
"replacing" '\"' with '\"', but so what?  Here is the simplfied
transformer:

    from pyparsing import Optional, replaceWith

    quotes = Optional(r'\\') + '"'
    quotes.setParseAction(replaceWith(r'\"'))
    for test in (test1, test2):
        print quotes.transformString(test)


Again, this prints out the desired output.

Now let's retrofit this altered logic back onto John Machin's
solution:

    import re
    for test in (test1, test2):
        print re.sub(r'\\?"', r'\"', test)


Pretty short and sweet, and pretty readable for an re.

To address Peter Otten's question about what to do with an escaped
backslash, I can't compose this with an re, but I can by adjusting the
first pyparsing version to include an escaped backslash as a "match
but don't do anything with it" expression, just like we did with
escQuote:

from pyparsing import Optional, Literal, replaceWith

    escQuote   = Literal(r'\"')
    unescQuote = Literal(r'"')
    unescQuote.setParseAction(replaceWith(r'\"'))
    backslash = chr(92)
    escBackslash = Literal(backslash+backslash)

    test3 = r'no " one \", two \\"'
    for test in (test1, test2, test3):
        print (escBackslash | escQuote |
unescQuote).transformString(test)

Prints:
    this I want \" while I dont want this \"
    frob this \" avoid this \", OK?
    no \" one \", two \\\"

At first I thought the last transform was an error, but on closer
inspection, I see that the input line ends with an escaped backslash,
followed by a lone '"', which must be replaced with '\"'.  So in the
transformed version we see '\\\"', the original escaped backslash,
followed by the replacement '\"' string.

Cheers,
-- Paul



More information about the Python-list mailing list