My first Python program -- a lexer

Sat Nov 8 17:53:57 EST 2008

On Nov 9, 7:55 am, Thomas Mlynarczyk <tho... at mlynarczyk-webdesign.de>
wrote:
> Hello,
>
> I started to write a lexer in Python -- my first attempt to do something
> useful with Python (rather than trying out snippets from tutorials). It
> is not complete yet, but I would like some feedback -- I'm a Python
> newbie and it seems that, with Python, there is always a simpler and
> better way to do it than you think.
>
> ### Begin ###
>
> import re
>
> class Lexer(object):

So far, so good.

>      def __init__( self, source, tokens ):

Be consistent with your punctuation style. I'd suggest *not* having a
space after ( and before ), as in the previous line. Read
http://www.python.org/dev/peps/pep-0008/

>          self.source = re.sub( r"\r?\n|\r\n", "\n", source )

Firstly, would you not expect to be getting your text from a text file
(perhaps even one opened with the universal newlines option) i.e. by
the time it's arrived here, source has already had \r\n changed to \n?

Secondly, that's equivalent to
   re.sub(r"\n|\r\n|\r\n", "\n", source)
What's wrong with
   re.sub(r"\r\n", "\n", source)
?

Thirdly, if source does contain \r\n and there is an error, the
reported value of offset will be incorrect. Consider retaining the
offset of the last newline seen, so that your error reporting can
include the line number and (include or use) the column position in
the line.

>          self.tokens = tokens
>          self.offset = 0
>          self.result = []
>          self.line   = 1
>          self._compile()
>          self._tokenize()
>
>      def _compile( self ):
>          for name, regex in self.tokens.iteritems():
>              self.tokens[name] = re.compile( regex, re.M )
>
>      def _tokenize( self ):

Unless you have other plans for it, offset could be local to this
method.

>          while self.offset < len( self.source ):

You may like to avoid getting len(self.source) for each token.

>              for name, regex in self.tokens.iteritems():

dict.iter<anything>() will return its results in essentially random
order. It doesn't matter with your example, but you will rapidly come
across real-world cases where the order matters. One such case is
distinguishing between real constants (1.23, .123, 123.) and integer
constants (123).

>                  match = regex.match( self.source, self.offset )
>                  if not match: continue
>                  self.offset += len( match.group(0) )
>                  self.result.append( ( name, match, self.line ) )
>                  self.line += match.group(0).count( "\n" )
>                  break
>              else:
>                  raise Exception(
>                      'Syntax error in source at offset %s' %
>                      str( self.offset ) )

Using str() here and below is redundant ... "%s" % obj is documented
to produce str(obj).

>
>      def __str__( self ):
>          return "\n".join(
>              [ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %

For avoidance of ambiguity, you may like to change that '%s' to %r

>                ( str( line ), str( match.pos ), name, match.group(0) )
>                for name, match, line in self.result ] )
>
HTH,
John