Python and regexp efficiency.. again.. :)

Sun Dec 12 08:58:31 EST 1999

What percentage of the lines is expected to actually match?
What percentage of the lines match the commonstring but none of the tails?
Would it be helpful to look just for the tails and get rid of erroneous
matches by then looking for the commonstring?

Yishai

Markus Stenberg wrote:

> Patrick Tufts <zippy at cs.brandeis.edu> writes:
> > In article <al8n1rj6tgv.fsf at sirppi.helsinki.fi>, Markus Stenberg
> > <mstenber at cc.Helsinki.FI> wrote:
> > > One order of magnitude optimization gain was received by writing
> > > a specialized regexp optimization tool - as the regexps are mostly
> > > of type
> > >                 ^commonstringuniquetail
> > >                 ^commonstringanothertail
> > Depending on how many different extensions there are to commonstring,
> > you might do better with the regexp:
>
> Regrettably, there's N different extensions.
>
> >    ^commonstring(.*)
> >
> > and then matching the saved pattern (.*) against a dictionary of
> > possible extensions.
>
> Basically, the common start is usually date, and the non-common parts are,
> depending on log type, for example service name and message string
> (syslog). Generally, the service+message combination is the interesting
> part, but to prevent false matches, their location on the line must be
> verified to be just after the date in the beginning on the line.
>
> Admittedly, I _think_ it might be somewhat faster (but not much) to do
> date-part-checking in C and then just use regexps to parse the tail, but I
> doubt I could gain order of magnitude in speed from that.
>
> > --Pat
>
> -Markus
>
> --
> "He who fights with monsters should look to it that he himself does
> not become a monster. And when you gaze long into an abyss the abyss
> also gazes into you."
>                     - Friedrich Nietzsche, _Beyond Good and Evil_