Help with Regular Expressions
Raymond Hettinger
othello at javanet.com
Mon Mar 12 17:48:46 EST 2001
Is there an idiom for how to use regular expressions for lexing?
My attempt below is unsatisfactory because it has to filter the
entire match group dictionary to find-out which token caused
the match. This approach isn't scalable because every token
match will require a loop over all possible token types.
I've fiddled with this one for hours and can't seem to find a
direct way get a group dictionary that contains only matches.
Many Thanks,
Raymond
from re import *
def tokenize(pattern, string):
'Create an items() style list of token types and token values'
pos = 0
ans = []
while 1:
m = pattern.search(string, pos)
if m == None: break
pos = m.end()
# the following gobblygook finds the first matched group name and value
ans.append( filter( lambda (k,v): v ,m.groupdict().items())[0] )
return ans
pgm = ' (a1->b & b->c) imp (a1->c)'
lexpat = compile(
r'(?P<op>(->)|(imp)|(&))|(?P<lpar>\()|(?P<rpar>\))|(?P<id>[A-Za-z]\w*)' )
print tokenize(lexpat, pgm)
# result is:
[('lpar', '('), ('id', 'a1'), ('op', '->'), ('id', 'b'), ('op', '&'), ('id',
'b'),
('op', '->'), ('id', 'c'), ('rpar', ')'), ('op', 'imp'), ('lpar', '('), ('id',
'a1'),
('op', '->'), ('id', 'c'), ('rpar', ')')]
More information about the Python-list
mailing list