difficult regular expression

Chermside, Michael mchermside at ingdirect.com
Wed Oct 30 10:50:53 EST 2002


> It's easy to grab the section containing the text "Cats ... foods." (not
> including the Chickens section.)  However, I need to get just the items:
> mice, rats, rabbits, marmots.


Not all regular expression parsers are equivalent.

Actually, that's not true... mathematically, they ARE equivalent in
important ways (and some aren't), but for practical purposes, RE
engines fall into two categories: those that support basic grep
features (plus-or-minus a few features) and those that support
perl 5 features (plus-or-minus a few features). Python supports
the perl 5 features (give-or-take a few features).

Two of those features are "look-ahead" and "look-behind" assertions.
these are zero-width assertions that make the RE match (or not
match) based on whether some (subsidiary) RE matches the string 
starting/ending at that point... but the text matched by the subsidiary
RE does NOT become part of the match of the main RE. If you find this
description confusing, check out the docs:
   http://python.org/doc/current/lib/re-syntax.html

So I'm going to try to use these to create a RE that solves your
problem. (The following is a transcript of my python session, but
with the typos and mistakes taken out.)

>>> text = """
... Here is a list of foods and consumers: Dogs eat <chicken>, <rice>,
... <steak>, and other foods. Cats eat <mice>, <rats>, <rabbits>, <marmots>,
... and other foods. Chickens eat <grain>, <corn>, <wheat>, and other
... foods.  Wow, that's a lot of eating!
... """
>>> # The text is really all one long line
>>> text = text.replace('\n',' ')
>>> re_1 = re.compile(r'Cats eat ((<.*?>, )+)and other foods.')
>>> matchObj = re_1.search(text)
>>> matchObj.groups()[0]
'<mice>, <rats>, <rabbits>, <marmots>, '
>>> # Okay, that worked. In fact, it might be enough to
>>> # solve the whole problem. But we'll try it with the
>>> # look-ahead/behind assertions anyway.
>>> re_2 = re.compile(r'(?<=Cats eat )((<.*?>, )+)(?=and other foods.)')
>>> matchObj = re_2.search(text)
>>> # now let's see what the entire pattern matched
>>> matchObj.group(0)
'<mice>, <rats>, <rabbits>, <marmots>, '
>>> # Yep... it works.

Does that help?

-- Michael Chermside




More information about the Python-list mailing list