SimpleParse performance compared to regex engine
Mike C. Fletcher
mcfletch at rogers.com
Sat May 24 20:25:42 EDT 2003
Strange as it may seem, parsers aren't generally very good at lexing
tasks (it's why most parsers use a lexer as a front-end). Regexen, on
the other hand, really are quite good at lexing. Although there are
ways you could speed up the SimpleParse grammar, I would be surprised if
you could make it significantly faster than the regex engine for this
particular type of task. mxTextTools simply doesn't have the low-level
lexing optimizations that would allow it.
To explain somewhat more:
SimpleParse does not have a particular (common) optimization based
on the analysis of the left most possible matching string to create
a "fast map" from a particular character/character-set to possible
matching sub-productions. The RE engine does have that
optimization, and since your entire grammar, in essence, is a simple
OR of token-types, (with each token a single item in the fast-map
table), the regex engine can determine which production(s) can match
with a single check against the current character at the top-level
production of the regular expression.
In contrast, with the way you have written the grammar, to determine
that, for instance, a space character is "plain", the mxTextTools
engine parses all of:
wikiname/email/table_open/table_close/rule/heading/emph/italic
checking each production until it can prove that the production does
not match. The fast-match optimization would store for the "markup
production" a set of possible leftmost strings for all
sub-productions, so that the entire check for the "OR"'d items could
be reduced to a check against that table, and then checking only the
registered productions.
You could likely speed up the grammar somewhat, but honestly, I
don't see the parser becoming noticeably faster than the lexer. As
a note, though, with the CVS version of SimpleParse, which has had
considerable optimizations in the last week, the difference between
the two approaches is approximately 2:1 for the regex engine (though
the timings vary from 2:1 to .9:1 depending on the phase of the moon
and memory effects on my machine).
Parsers (and particularly SimpleParse) become useful when you have
highly structured data, where the low-level lexing operations are
swamped by the organizational complexity of the data and what you
need to do is rapidly extract the *structure* of the data.
Good luck with your project,
Mike
Austine Jane wrote:
> Dear Mike,
>
> I've been experimenting with your wonderful module SimpleParse (thanks).
>
> I've come up with a few questions concerning the performance of my
> code using SimpleParse. It was slower than regex version. Could you
> have a look at it, please?
>
> I posted it on c.l.p a minute ago.
>
> Thanks in advance.
>
> Jane
_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/
More information about the Python-list
mailing list