Finding messages in huge mboxes

Bastiaan Welmers haasje at welmers.net
Mon Feb 2 15:37:05 EST 2004


Hi,

I wondered if anyone has ever met this same mbox issue.

I'm having the following problem:

I need find messages in huge mbox files (50MB or more).
The following way is (of course?) not very usable:

fp = open("mbox", "r")
archive = mailbox.UnixMailbox(fp)
i=0
while i < message_number_needed:
   i+=1
   archive.next()

needed_message = archive.next()

Especially because I often need messages at the end
of the MBOX file.
So I tried the following (scanning messages backwards
on found "From " lines with readline())

i=0
j=0
while 1:
  i+=1
  fp.seek(-i, SEEK_TO_END=2)
  line = fp.readline()
  if not line:
     break
  if line[:5] == 'From ':
     j+=1
     if j == total_messages - message_number_needed:
        archive.seekp = fp.tell()
        message = archive.next()
        # message found

But also seems to be slow and CPU consuming.

Anyone who has a better idea?

Regards,

Bastiaan Welmers



More information about the Python-list mailing list