Memory consumpation TPG (Toy Parser Generator for Python) & re expressions
Christophe Delord
christophe.delord at free.fr
Sun May 25 16:13:36 EDT 2003
Hello again Dominic ;-)
On Sat, 24 May 2003 15:16:09 +0200, Dominic wrote:
>
> I've written a small parser for
> mailbox parsing:
> My first version used a non Context
> sensitive lexer.
> Tokenizing an entire mailbox file
> caused python to trash.
A Token object is created for each token. TPG is not designed to handle
huge data.
>
> My second attempt with a cs-lexer
> has been more successful I've been
> able to parse altogether (several files) 43MB in
> 13min which is not sooo fast.
> (Memory consumpation by python between 4-14MB is still ok)
CSL lexer consumes less memory since Token objects are created while
parsing. But when TPG backtracks the input may be scanned several times.
A solution could be to split mailboxes into single messages and then
parse each message.
A better solution is the rfc822 module.
> I also encountered a bug in TPG.
> When a CS-Lexer/Parser is generated
> the class inherits from ToyParser1
> instead of ToyParserCSL or so.
This bug has been introduced in the last release :-(
I've just fixed it (in the last beta release, ie 2.1.6-dev).
Thanks.
>
> So any suggestions for improvements
> are welcome :-)
>
>
> Ciao,
> Dominic
>
>
>
>
--
(o_ Christophe Delord _o)
//\ http://christophe.delord.free.fr/ /\\
V_/_ mailto:christophe.delord at free.fr _\_V
More information about the Python-list
mailing list