My first Python program -- a lexer

Mon Nov 10 16:02:53 EST 2008

On Nov 11, 12:26 am, Thomas Mlynarczyk <tho... at mlynarczyk-
webdesign.de> wrote:
> John Machin schrieb:
>
> >> On the other hand: If all my tokens are "mutually exclusive" then,
> > But they won't *always* be mutually exclusive (another example is
> > relational operators (< vs <=, > vs >=)) and AFAICT there is nothing
> > useful that the lexer can do with an assumption/guess/input that they
> > are mutually exclusive or not.
>
> "<" vs. "<=" can be handled with lookaheads (?=...) / (?!...) in regular
> expressions.

Single-character tokens like "<" may be more efficiently handled by
doing a dict lookup after failing to find a match in the list of
(name, regex) tuples.

> True, the lexer cannot do anything useful with the
> assumption that all tokens are mutually exclusive. But if they are,
> there will be no ambiguity and I am guaranteed to get always the same
> sequence of tokens from the same input string.

So what? That is useless knowledge. It is the ambiguous cases that you
need to be concerned with.

>
> > Your Lexer class should promise to check the regexes in the order
> > given. Then the users of your lexer can arrange the order to suit
> > themselves.
>
> Yes. So there's no way around a list of tuples instead of dict().

Correct.

>
> > Your code uses dict methods; this forces your callers to *create* a
> > mapping. However (as I said) your code doesn't *use* that mapping --
> > there is no RHS usage of dict[key] or dict.get(key) etc. In fact I'm
> > having difficulty imagining what possible practical use there could be
> > for a mapping from token-name to regex.
>
> Sorry, but I still don't quite get it.
>
> for name, regex in self.tokens.iteritems():
>      # ...
>      self.result.append( ( name, match, self.line ) )
>
> What I do here is take a name and its associated regex and then store a
> tuple (name, match, line). In a simpler version of the lexer, I might
> store only `name` instead of the tuple. Is your point that the lexer
> doesn't care what `name` actually is, but simply passes it through from
> the tokenlist to the result?

No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.