SimpleParse performance compared to regex engine

Mike C. Fletcher mcfletch at rogers.com
Sat May 24 20:25:42 EDT 2003


Strange as it may seem, parsers aren't generally very good at lexing 
tasks (it's why most parsers use a lexer as a front-end).  Regexen, on 
the other hand, really are quite good at lexing.  Although there are 
ways you could speed up the SimpleParse grammar, I would be surprised if 
you could make it significantly faster than the regex engine for this 
particular type of task. mxTextTools simply doesn't have the low-level 
lexing optimizations that would allow it.

To explain somewhat more:

    SimpleParse does not have a particular (common) optimization based
    on the analysis of the left most possible matching string to create
    a "fast map" from a particular character/character-set to possible
    matching sub-productions.  The RE engine does have that
    optimization, and since your entire grammar, in essence, is a simple
    OR of token-types, (with each token a single item in the fast-map
    table), the regex engine can determine which production(s) can match
    with a single check against the current character at the top-level
    production of the regular expression.

    In contrast, with the way you have written the grammar, to determine
    that, for instance, a space character is "plain", the mxTextTools
    engine parses all of:

wikiname/email/table_open/table_close/rule/heading/emph/italic
  

    checking each production until it can prove that the production does
    not match.  The fast-match optimization would store for the "markup
    production" a set of possible leftmost strings for all
    sub-productions, so that the entire check for the "OR"'d items could
    be reduced to a check against that table, and then checking only the
    registered productions.

    You could likely speed up the grammar somewhat, but honestly, I
    don't see the parser becoming noticeably faster than the lexer.  As
    a note, though, with the CVS version of SimpleParse, which has had
    considerable optimizations in the last week, the difference between
    the two approaches is approximately 2:1 for the regex engine (though
    the timings vary from 2:1 to .9:1 depending on the phase of the moon
    and memory effects on my machine).

    Parsers (and particularly SimpleParse) become useful when you have
    highly structured data, where the low-level lexing operations are
    swamped by the organizational complexity of the data and what you
    need to do is rapidly extract the *structure* of the data.

Good luck with your project,
Mike


Austine Jane wrote:

> Dear Mike,
>
> I've been experimenting with your wonderful module SimpleParse (thanks).
>
> I've come up with a few questions concerning the performance of my 
> code using SimpleParse. It was slower than regex version. Could you 
> have a look at it, please?
>
> I posted it on c.l.p a minute ago.
>
> Thanks in advance.
>
> Jane

_______________________________________
  Mike C. Fletcher
  Designer, VR Plumber, Coder
  http://members.rogers.com/mcfletch/








More information about the Python-list mailing list