re question

Sat Oct 16 06:19:16 EDT 1999

[Max M. Stalnaker]
> I have the following code:
>
>  def subset(self):
>   group=re.search(r"%%%([^%]+)%%%",self.data)
>   self.data=group.groups(0)[0]

Easier to access group.group(1) directly.

> Essentially, I get a html page, change some tags to %%% and extract the
> stuff between.  But the way I do it above fails if the stuff between has a
> single %.

This is much like trying to match a Python triple-quoted string with a
regexp.  Try this:

sucker = re.compile(r"""
    %%%
    (
        [^%]*
        (?: % (?! %%)  # take a % if not followed immediately by %%
            [^%]*
        )*
    )
    %%%
""", re.VERBOSE)

matches = re.findall(sucker, data)

That yields a list of all the guts.

> The main goal is to extract the stuff.  The changing the tags is just
> the way I tried and had sometime success.
>
> Maybe there is a better way to do this.

Regular expressions are overkill here.  The above can be done quicker and
easier via string.find:

i = 0
matches = []
while 1:
    i = string.find(data, "%%%", i)   # find opening %%%
    if i < 0:
        break
    j = string.find(data, "%%%", i+3) # find next %%%
    if j < 0:
        break
    matches.append(data[i+3:j])       # take the stuff between them
    i = j+3                           # resume where this one ended

> ...
> My current idea is to construct a single character sentinel out
> of something greater than chr(128) and use that.  This will probably
> work in this application, but I feel like I am missing something.

Regexps are good at lexical classification but suck for parsing -- wrong
tool for the job.  Rule of thumb:  as soon as you write a regexp that fails
to work correctly, drop the idea -- you'll end up never using regexps at all
<wink>.  Really, trick-based parsing rarely works correctly, unless you're a
bot and have enough spare cycles to try all possible tricks.

If you're going to be in the business of parsing HTML, take some time to
learn how to use the htmllib module.  It works -- and when it doesn't
<wink>, you can just file a bug report.

regexps-are-to-html-as-regexps-are-to-raising-children-ly y'rs  - tim