Memory consumpation TPG (Toy Parser Generator for Python) & re expressions

Sat May 24 09:16:09 EDT 2003

I've written a small parser for
mailbox parsing:
My first version used a non Context
sensitive lexer.
Tokenizing an entire mailbox file
caused python to trash.

My second attempt with a cs-lexer
has been more successful I've been
able to parse altogether (several files) 43MB in
13min which is not sooo fast.
(Memory consumpation by python between 4-14MB is still ok)

One problem has been to match
a string until e.g. "\r\n\r\nFrom -"
appears. My re-expressions caused
python to exit with a recursion
depth exception. (I think I used
something like "(?s).*?\r\n\r\n(?=From - )".
Now I broke that into lexical rules
which consumes more memory and CPU cycles. :-(

I also liked the non-CS-Lexer more
since it made my grammar & tokens easier.
But I see no easy way to stop TPG from
first tokenizing the entire mailbox.

I also encountered a bug in TPG.
When a CS-Lexer/Parser is generated
the class inherits from ToyParser1
instead of ToyParserCSL or so.

So any suggestions for improvements
are welcome :-)

Ciao,
   Dominic

-------------- next part --------------
A non-text attachment was scrubbed...
Name: test1.py
Type: text/x-python
Size: 2489 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20030524/de9e6dbb/attachment.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test2.py
Type: text/x-python
Size: 2848 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20030524/de9e6dbb/attachment-0001.py>