My first Python program -- a lexer

Wed Nov 12 14:18:51 EST 2008

Steve Holden schrieb:

>> Suppose I use the dict and I want to access the regex associatetd with
>> the token named "tokenname" (that is, no iteration, but a single
>> access). I could simple write tokendict["tokenname"]. But with the list
>> of tuples, I can't think of an equally easy way to do that. But then, as
>> a beginner, I might be underestimating Python.

> But when do you want to do that? There's no point inventing use cases -
> they should be demonstrated needs.

Well, I had been thinking about further reducing the number of regex 
matchings needed. So I wanted to modify my lexer not to tokenize the 
whole input at once, but only try to grab the next token from the input 
"just in time" / on demand. For that I was thinking of having a next() 
method like this:

     def next( self, nameOfExpectedToken ):
         regex = self.getRegexByTokenName( nameOfExpectedToken )
         match = regex.match( self.source, self.offset )
         if not match: return False
         line = self.line
         self.line += match.group(0).count( "\n" )
         self.offset += len( match.group(0) )
         return ( nameOfExpectedToken, match, line )

I'm not sure if this is a good idea, but it looks like one to me. The 
problem is the first line of the method which retrieves the regex 
associated with the given token name. Using a dict, I could simply write

         regex = self.tokendict[nameOfExpectedToken]

But with a list I suppose I wouldn't get away without a loop. Which I 
assume is more expensive that the dict.

> Or simply pass compiled token patterns in in the first place when they
> are necessary ... then the caller has the option of not bothering to
> optimize in the first place!

That would be an option. But shouldn't it be the lexer who takes care of 
optimizing its own work as much as it can do without the caller's 
assistance? After all, the caller should not need to know about the 
internal workings of the lexer.

[Optimizing performance by putting most frequent tokens first]
> With a dict you have no such opportunity, because the ordering is
> determined by the implementation and not by your data structure.

True. Still, I should be able to gain even better performance with my 
above approach using a next() function, as this would completely 
eliminate all "useless" matching (like trying to match FOO where no foo 
is allowed).

Greetings,
Thomas

-- 
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)