Finding messages in huge mboxes

Donn Cave donn at drizzle.com
Tue Feb 3 01:58:45 EST 2004


Quoth cookedm+news at physics.mcmaster.ca (David M. Cooke):
| At some point, Donn Cave <donn at u.washington.edu> wrote:
|> In article <401eb54c$0$315$e4fe514c at news.xs4all.nl>,
|>  Bastiaan Welmers <haasje at welmers.net> wrote:
|> ...
|>> I need find messages in huge mbox files (50MB or more).
|> ...
|>> Especially because I often need messages at the end
|>> of the MBOX file.
|>> So I tried the following (scanning messages backwards
|>> on found "From " lines with readline())
|>
|> readline() is not your friend here.  I suggest that
|> you read large blocks of data, like 8192 bytes for
|> example, and search them iteratively.  Like,
|> next = block.find('\nFrom ', prev + 1)
|
| Unless, of course, you read '\nFr', then 'om ' in the next block...
|
| I can't think of a simple way around this (except for reading by
| lines). Concating the last two together means having to keep track of
| what you've seen in the last block. Maybe picking off the last line
| from the last block (using line.rfind('\n')), and concatenating that
| to the beginning of the next.

I'm reading from the end backwards, so the fragment is block[:start].
Append that to the block before it, and each block always will end at
a message boundary.  If you start in the middle, you have to deal with
an extra boundary problem.  If reading forward from the beginning, it
would be about as simple.

If I have overlooked some obvious problem with this, it wouldn't be
the first time, but I think it's as simple as it could be.  The only
inelegance to it is that you have to scan the fragment at least twice
(one extra time for each time it's added to a new block.)

	Donn Cave, donn at drizzle.com



More information about the Python-list mailing list