[Python-Dev] A standard lexer?

Tim Peters tim_one@email.msn.com
Sun, 2 Jul 2000 13:21:03 -0400


[Paul Prescod]
> As an aside: I would be pumped about getting a generic lexer into the
> Python distribution.

[Fredrik Lundh]
> how about this quick and dirty proposal:
>
> - add a new primitive to SRE: (?P#n), where n is a small integer.
>   this primitive sets the match object's "index" variable to n when
>   the engine stumbles upon it.

Note that the lack of "something like this" is one of the real barriers to
speeding SPARK's lexing, and the speed of a SPARK lexer now (well, last I
looked into this) can be wildly dependent on the order in which you define
your lexing methods (partly because there's no way to figure out which
lexing method matched without iterating through all the groups to find the
first that isn't None).

The same kind of irritating iteration is needed in IDLE and pyclbr too
(disguised as unrolled if/elif/elif/... chains), and in tokenize.py (there
*really* disguised in a convoluted way, by doing more string tests on the
matched substring to *infer* which of the regexp pattern chunks must have
matched).

OTOH, arbitrary small integers are not Pythonic.  Your example *generates*
them in order to guarantee they're unique, which is a bad sign (it implies
users can't do this safely by hand, and I believe that's the truth of it
too):

>         for phrase, action in lexicon:
>             p.append("(?:%s)(?P#%d)" % (phrase, len(p)))

How about instead enhancing existing (?P<name>pattern) notation, to set a
new match object attribute to name if & when pattern matches?  Then
arbitrary info associated with a named pattern can be gotten at via dicts
via the pattern name, & the whole mess should be more readable.

On the third hand, I'm really loathe to add more gimmicks to stinking
regexps.  But, on the fourth hand, no alternative yet has proven popular
enough to move away from those suckers.

if-you-can't-get-a-new-car-at-least-tune-up-the-old-one-ly y'rs  - tim