Speeding up multiple regex matches
Talin
viridia at gmail.com
Sat Nov 19 15:31:42 EST 2005
OK that worked really well. In particular, the "lastindex" property of
the match object can be used to tell exactly which group matched,
without having to sequentially search the list of groups.
In fact, I was able to use your idea to cobble together a poor man's
lexer which I am calling "reflex" (Regular Expressions For Lexing).
Here's an example of how it's used:
# Define the states using an enumeration
State = Enum( 'Default', 'Comment', 'String' )
# Create a scanner
scanner = reflex.scanner( State.Default )
scanner.rule( "\s+" )
scanner.rule( "/\*", reflex.set_state( State.Comment ) )
scanner.rule( "[a-zA-Z_][\w_]*", KeywordOrIdent )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TokenType.Integer )
scanner.rule(
"(?:\d+\.\d*|\.\d+)(?:[eE]?[+-]?\d+)|\d+[eE]?[+-]?\d+",
token=TokenType.Real )
# Multi-line comment state
scanner.state( State.Comment )
scanner.rule( "\*/", reflex.set_state( State.Default ) )
scanner.rule( "(?:[^*]|\*(?!/))+" )
# Now, create an instance of the scanner
token_stream = scanner( input_file_iter )
for token in token_stream:
print token
Internally, it creates an array of patterns and actions for each state.
Then when you ask it to create a scanner instance, it combines all of
the patterns into a large regular expression. Input lines are marched
vs. this regex, and if a match succeeds, then the match object's
'lastindenx' property is used to look up the actions to perform in the
array.
More information about the Python-list
mailing list