Finding messages in huge mboxes

Donn Cave donn at u.washington.edu
Mon Feb 2 16:43:30 EST 2004


In article <401eb54c$0$315$e4fe514c at news.xs4all.nl>,
 Bastiaan Welmers <haasje at welmers.net> wrote:
...
> I need find messages in huge mbox files (50MB or more).
...
> Especially because I often need messages at the end
> of the MBOX file.
> So I tried the following (scanning messages backwards
> on found "From " lines with readline())

readline() is not your friend here.  I suggest that
you read large blocks of data, like 8192 bytes for
example, and search them iteratively.  Like,
next = block.find('\nFrom ', prev + 1)

This will give you the location of each message in
the current block, so you can split the block up
into a list of messages.  (There will be an extra
chunk of data at the beginning of each block, before
the first "From " - recycle that onto the end of the
next block.)

Since file object buffering is at best useless in this
application, I would use posix.open, posix.lseek and
posix.read.  Taking this approach, I find that reading
the last 10 messages in a 100 Mb folder takes 0.05 sec.

   Donn Cave, donn at u.washington.edu



More information about the Python-list mailing list