Akward code using multiple regexp searches

Fri Sep 10 03:01:07 EDT 2004

Topher Cawlfield wrote:
 > But a few times already I've found myself
> writing the following bit of awkward code when parsing text files.  Can 
> anyone suggest a more elegant solution?
> 
> rexp1 = re.compile(r'blah(dee)blah')
> rexp2 = re.compile(r'hum(dum)')
> for line in inFile:
>     reslt = rexp1.search(line)
>     if reslt:
>         something = reslt.group(1)
>     else:
>         reslt = rexp2.search(line)
>         if reslt:
>             somethingElse = reslt.group(1)

I usually solve this given case with a 'continue'

  for line in inFile:
      reslt = rexp1.search(line)
      if reslt:
          something = reslt.group(1)
          continue
      reslt = rexp2.search(line)
      if reslt:
          somethingElse = reslt.group(1)
          continue

Still more cumbersome than the Perl equivalent.

You could do a trick like this

import re

class Match:
   def __init__(self, pattern, flags=0):
     self.pat = re.compile(pattern, flags)
     self.m = None
   def __call__(self, s):
     self.m = self.pat.match(s)
     return bool(self.m)
   def __nonzero__(self):
     return bool(self.m)
   def group(self, x):
     return self.m.group(x)
   def start(self, x):
     return self.m.start(x)
   def end(self, x):
     return self.m.end(x)

pat1 = Match("A(.*)")
pat2 = Match("BA(.*)")
pat3 = Match("BB(.*)")

def test(s):
   if pat1(s): print "Looks like", pat1.group(1)
   elif pat2(s): print "no, it is", pat2.group(1)
   elif pat3(s): print "really?", pat3.group(1)
   else: print "Never mind."

 >>> test("ABCDE")
Looks like BCDE
 >>> test("BACDE")
no, it is CDE
 >>> test("BBCDE")
really? CDE
 >>> test("CBBDE")
Never mind.
 >>>

This is much more along the lines of what you want
but it conflates the idea of search object and
match object and makes your code more suspectible
to subtle breaks.  Consider

digits = Match("(\s*(\d+)\s*)")

def divisor(s):
   if s[:1] == "/":
     if digits(s[1:]):
       return int(digits.group(2))
     raise TypeError("nothing after the /")
   # no fraction, use 1 as the divisor
   return 1

def fraction(s):
   if digits(s):
     denom = divisor(s[digits.end(1):])
     return int(digits.group(2)), denom
   raise TypeError("does not start with a number")

 >>> fraction("4/5")
(5, 5)
 >>>

But as a Perl programmer you are perhaps used to this
because Perl does the same conflation thus having
the same problems.  (I think.  It's been a while ...
Nope!  The regexp search results appear to be my
variables now.  When I started with perl4 all variables
were either global or "dynamically scoped"-ish with
local)

> I'm a little bit worried about doing the following in Python, since I'm 
> not sure if the compiler is smart enough to avoid doing each regexp 
> search twice:
> 
> for line in inFile:
>     if rexp1.search(line)
>         something = rexp1.search(line).group(1)
>     elif rexp2.search(line):
>         somethingElse = rexp2.search(line).group(1)
> 
> In many cases I am worried about efficiency as these scripts parse a 
> couple GB of text!

It isn't smart enough.  To make it that smart would require
a lot more work.  For example, how does it know that the
implementation of "rexp1.search(line)" always returns the
same value?  Or even that "rexp1.search" returns the
same bound method?

				Andrew
				dalke at dalkescientific.com