Help needed: cryptic perl regular expression in python syntax, Ugly solution

Paul McGuire ptmcg at austin.rr._bogus_.com
Wed Oct 20 08:39:45 EDT 2004


"Steven Bethard" <steven.bethard at gmail.com> wrote in message
news:mailman.5207.1098253340.5135.python-list at python.org...
> Could you do something like:
>
> >>> line = '   s^\\?AAA\\?01^BBB^g; #Comment '
> >>> expr = r'(^\s*)(s|tr)(.)(\\\?%s)\3(.*?)\3(.*)'
> >>> matcher = re.compile(expr % re.escape("AAA\?01"))
> >>> matcher.findall(line)
> [('   ', 's', '^', '\\?AAA\\?01', 'BBB', 'g; #Comment ')]
>
> Basically, I still use the r'' string so that I don't have to write so
many
> backslashes, but then I use a %s to insert the "AAA\?01" into the middle
of
> the expression.  Looks at least a little cleaner to me.
>
> Steve
>

Here's a more verbose version of Steve Bethard's suggestion.  By building
up the regexp from individual parts, it is possible to give each part some
semi-meaningful name, or to attach comments to individual pieces.  It also
makes it easier to maintain later.  What if you had to support an additional
command besides s and tr, like 'rep'?  Just change replaceCmd to read
replaceCmd = r'(s|tr|rep)'.  What if you needed to support leading tabs
in addition to leading spaces?  Change leadingWhite as needed.  For
that matter, just giving the finished regexp the name 'replaceCmdExpr'
gives the reader more of a clue as to what the regexp's purpose is,
as the original code did with extra comments.

I find nearly *all* regexp's to be cryptic, and when I need them, I
usually assemble them in some fashion such as this.  David Mertz
proposes a similar style in his very good book, "Text Processing
in Python."

(Some quibble with the practice of aligning '=' signs, but I find it to be a
helpful guide to the eye when declaring a set of related strings such as
these, assuming of course that one edits using a fixed space font.)

So why does the key get prepended with the backslashes and
question marks?

-- Paul
(I'll bet you thought I'd post a pyparsing version. :)  Well, in a
certain way, I did.)


import re

line = '   s^\\?AAA\\?01^BBB^g; #Comment '

r1 = r'(^\s*)(s|tr)(.)(\\\?\\??'
key = "AAA\?01"
r2 = r'\\??)\3(.*?)\3(.*)'
r = r1 + re.escape(key) + r2
print re.compile(r).findall(line)

# desired regexp, from Steve Bethard's post
#  r'(^\s*)(s|tr)(.)(\\\?%s)\3(.*?)\3(.*)'

# build up regexp by parts
key           = r'AAA\?01'
leadingWhite  = r'(^\s*)'
replaceCmd    = r'(s|tr)'
sepChar       = r'(.)'
# prepend \'s and ?'s, only the OP knows why...
findString    = r'(\\\?\\??%s)' % re.escape(key)
# sepCharRef references the char read by sepChar,
# to support separators other than '^'
sepCharRef    = r'\3'
replString    = r'(.*?)'
restOfLine    = r'(.*)'
replaceCmdExpr = leadingWhite + replaceCmd + \
         sepChar + findString + sepCharRef + \
         replString + sepCharRef + restOfLine

matcher = re.compile( replaceCmdExpr )
print matcher.findall(line)






More information about the Python-list mailing list