Lexing in Python 2

Sun Jan 23 23:31:40 EST 2000

What would the requirements for a lexer be?

	Be able to write token definitions with simple, possibly regex-like
or BNF-like definitions?
	Be able to drop in more complex state-table definitions for
harder-to-define constructs.
	Fast operation on "normal" computer-and-human-language tokenisation.
	<Compatibility should be in here somewhere>
	Output as something like (tokentype, startindex, stopindex )
(optional children)

I would guess you could create such a beast with simpleparse + mxTextTools
without breaking too much of a sweat.  Of course, as Tim has pointed out
before, it would be an alien sweat, so lots of ammonia and formaldehyde, but
that's always a risk.  You should be able to save the parser-definition
tables Simpleparse generates so that you only need mxTextTools at run-time
(I believe the tables are now pickle-able).  Note: mxTextTools doesn't
current support token-start-mapping (I think this is called the fastmap
optimisation in regex land).  Using mxTextTools's splitting features might
work better for whitespace determination.

Of course, there's no really useful error handling in these systems, so
you'd have to add that.  Likely also want a line-counting feature (should be
a trivial task, but I've never got around to it).  For real utility, you'd
want to do some hacking on mxTextTools to let it parse streams instead of
strings.

Looking at Chris Tismer's RTF loader, of course, is a good place to start
for designing really fast streaming tokenisers.

Ah well, stuff I don't have time to work on.  Suppose I'll stop rambling
about it. Enjoy yourselves,
Mike

-----Original Message-----
From: Tim Peters [mailto:tim_one at email.msn.com]
Sent: Sunday, January 23, 2000 10:55 PM
To: python-list at python.org
Subject: RE: Lexing in Python 2
...
If nobody was motivated enough to write the code for Python 1, I don't know
why that would change for Python 3000 (that's what Guido insists on calling
it now <wink>).  If you want a *fast* Python lexer today, mxTextTools is
your best hope.

> It is my unconsidered, uneducated opinion that lexers do not
> vary as widely as parsers (LL(1), LR(1), LR(N) etc.) so we
> could just choose one at random and start building modules
> around it.

Curiously, mxTextTools is nothing like lex/flex.  Flex does such a good job
it's hard to get motivated to duplicate all that effort (it's not easy)
solely to get something releasable under a more Python-like license.  I
don't know how Marc-Andre would feel about folding mxTextTools into the
distribution.
...