Memory consumpation TPG (Toy Parser Generator for Python) & re expressions

Sun May 25 16:13:36 EDT 2003

Hello again Dominic ;-)

On Sat, 24 May 2003 15:16:09 +0200, Dominic wrote:

> 
> I've written a small parser for
> mailbox parsing:
> My first version used a non Context
> sensitive lexer.
> Tokenizing an entire mailbox file
> caused python to trash.

A Token object is created for each token. TPG is not designed to handle
huge data.

> 
> My second attempt with a cs-lexer
> has been more successful I've been
> able to parse altogether (several files) 43MB in
> 13min which is not sooo fast.
> (Memory consumpation by python between 4-14MB is still ok)

CSL lexer consumes less memory since Token objects are created while
parsing. But when TPG backtracks the input may be scanned several times.

A solution could be to split mailboxes into single messages and then
parse each message.

A better solution is the rfc822 module.

> I also encountered a bug in TPG.
> When a CS-Lexer/Parser is generated
> the class inherits from ToyParser1
> instead of ToyParserCSL or so.

This bug has been introduced in the last release :-(
I've just fixed it (in the last beta release, ie 2.1.6-dev).
Thanks.

> 
> So any suggestions for improvements
> are welcome :-)
> 
> 
> Ciao,
>    Dominic
> 
> 
> 
> 

-- 

(o_   Christophe Delord                   _o)
//\   http://christophe.delord.free.fr/   /\\
V_/_  mailto:christophe.delord at free.fr   _\_V