re question

Tim Peters tim_one at email.msn.com
Sat Oct 16 18:55:45 EDT 1999


[Moshe Zadka, disliking
r"""
    %%%
    ( [^%]*
      (?: % (?! %%)
          [^%]*
      )*
    )
    %%%
"""
]

> I may be missing something obvious here, but that seems to positively
> /beg/ for a minimalistic match, something like:
>
> re.compile(r'%%%(.*?)%%%')
>
> This will result in taking everything until the *first* %%% following our
> initial %%%, which was what the poster wanted.

As does the regexp above.  The first one is better for two reasons:

1) It's more robust under modification.  That is, "the loop" in the first
regexp says "match up until the first %%%" (if any); the loop in the second
says "match some number of characters", and what those are depends entirely
on the loop's context.  The loop in the first means the same thing
regardless of context, so can be reused in other contexts without surprises.

2) It's much faster.

Since Max seemed to be asking in "copy and paste" mode, better to give him
something that's less likely to screw him tomorrow <wink>.

[Tim]
>> Regular expressions are overkill here.  The above can be done
>> quicker and easier via string.find:

[Moshe]
> I agree. And much, MUCH better if the text to be match grows to be more
> then a couple of KBs, at which pcre will stumble somehow, if it all.

PCRE isn't that sluggish, provided you learn how this flavor of regexps
works.  Friedl's "Mastering Regular Expression" book is wholly applicable.
The regexp at the top is a vanilla instance of his "loop unrolling" pattern,
and won't choke on inputs of any size, whether matching or non-matching.

Here's a made-up 1Mb string:

guts = 999 * 'x' + '%'
data = '%%%' + 1000* guts + '%%%'

Searching this with the original regexp is slower than using string.find,
but well under a factor of two slower.  Searching it with the minimal-match
version is about 14 times slower (btw, minimal-match patterns are much
slower in Perl too).

regexps-are-as-obscure-as-floating-point-ly y'rs  - tim






More information about the Python-list mailing list