Speeding up multiple regex matches

Talin viridia at gmail.com
Sat Nov 19 15:31:42 EST 2005


OK that worked really well. In particular, the "lastindex" property of
the match object can be used to tell exactly which group matched,
without having to sequentially search the list of groups.

In fact, I was able to use your idea to cobble together a poor man's
lexer which I am calling "reflex" (Regular Expressions For Lexing).
Here's an example of how it's used:

    # Define the states using an enumeration
    State = Enum( 'Default', 'Comment', 'String' )

    # Create a scanner
    scanner = reflex.scanner( State.Default )
    scanner.rule( "\s+" )
    scanner.rule( "/\*", reflex.set_state( State.Comment ) )
    scanner.rule( "[a-zA-Z_][\w_]*", KeywordOrIdent )
    scanner.rule( "0x[\da-fA-F]+|\d+", token=TokenType.Integer )
    scanner.rule(
"(?:\d+\.\d*|\.\d+)(?:[eE]?[+-]?\d+)|\d+[eE]?[+-]?\d+",
token=TokenType.Real )

    # Multi-line comment state
    scanner.state( State.Comment )
    scanner.rule( "\*/", reflex.set_state( State.Default ) )
    scanner.rule( "(?:[^*]|\*(?!/))+" )

    # Now, create an instance of the scanner
    token_stream = scanner( input_file_iter )
    for token in token_stream:
        print token

Internally, it creates an array of patterns and actions for each state.
Then when you ask it to create a scanner instance, it combines all of
the patterns into a large regular expression. Input lines are marched
vs. this regex, and if a match succeeds, then the match object's
'lastindenx' property is used to look up the actions to perform in the
array.




More information about the Python-list mailing list