Finding messages in huge mboxes

David M. Cooke cookedm+news at physics.mcmaster.ca
Mon Feb 2 21:03:34 EST 2004


At some point, Donn Cave <donn at u.washington.edu> wrote:

> In article <401eb54c$0$315$e4fe514c at news.xs4all.nl>,
>  Bastiaan Welmers <haasje at welmers.net> wrote:
> ...
>> I need find messages in huge mbox files (50MB or more).
> ...
>> Especially because I often need messages at the end
>> of the MBOX file.
>> So I tried the following (scanning messages backwards
>> on found "From " lines with readline())
>
> readline() is not your friend here.  I suggest that
> you read large blocks of data, like 8192 bytes for
> example, and search them iteratively.  Like,
> next = block.find('\nFrom ', prev + 1)

Unless, of course, you read '\nFr', then 'om ' in the next block...

I can't think of a simple way around this (except for reading by
lines). Concating the last two together means having to keep track of
what you've seen in the last block. Maybe picking off the last line
from the last block (using line.rfind('\n')), and concatenating that
to the beginning of the next.

-- 
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke
|cookedm(at)physics(dot)mcmaster(dot)ca



More information about the Python-list mailing list