Help with Regular Expressions

Mon Mar 12 17:48:46 EST 2001

Is there an idiom for how to use regular expressions for lexing?

My attempt below is unsatisfactory because it has to filter the
entire match group dictionary to find-out which token caused
the match. This approach isn't scalable because every token
match will require a loop over all possible token types.

I've fiddled with this one for hours and can't seem to find a
direct way get a group dictionary that contains only matches.

Many Thanks,

Raymond

from re import *

def tokenize(pattern, string):
    'Create an items() style list of token types and token values'
    pos = 0
    ans = []
    while 1:
        m = pattern.search(string, pos)
        if m == None: break
        pos = m.end()
        # the following gobblygook finds the first matched group name and value
        ans.append( filter( lambda (k,v): v ,m.groupdict().items())[0] )
    return ans

pgm = ' (a1->b & b->c) imp (a1->c)'
lexpat = compile(
r'(?P<op>(->)|(imp)|(&))|(?P<lpar>\()|(?P<rpar>\))|(?P<id>[A-Za-z]\w*)' )
print tokenize(lexpat, pgm)

# result is:
[('lpar', '('), ('id', 'a1'), ('op', '->'), ('id', 'b'), ('op', '&'), ('id',
'b'),
('op', '->'), ('id', 'c'), ('rpar', ')'), ('op', 'imp'), ('lpar', '('), ('id',
'a1'),
('op', '->'), ('id', 'c'), ('rpar', ')')]