Finding messages in huge mboxes

Erno Kuusela erno-news at erno.iki.fi
Tue Feb 3 15:21:33 EST 2004


Bastiaan Welmers <haasje at welmers.net> writes:

> 
> Especially because I often need messages at the end
> of the MBOX file.
> So I tried the following (scanning messages backwards
> on found "From " lines with readline())
> 
> i=0
> j=0
> while 1:
>   i+=1
>   fp.seek(-i, SEEK_TO_END=2)
>   line = fp.readline()
>   if not line:
>      break
>   if line[:5] == 'From ':
>      j+=1
>      if j == total_messages - message_number_needed:
>         archive.seekp = fp.tell()
>         message = archive.next()
>         # message found
> 
> But also seems to be slow and CPU consuming.

something like this might work. the loop below scanned a 115MB mailbox
in about 1 second on a 1.2ghz k7. extracts the next-to-last message,
but you get the idea. if you don't want to read the file into cache,
you could adapt it to start with a smaller mmapped chunk from the end
of the file and enlarge it until you find what you want.


import os, re, mmap, sys
from cStringIO import StringIO
import email

fd = os.open(sys.argv[1], os.O_RDONLY)
size = os.fstat(fd).st_size
print size
buf = mmap.mmap(fd, size, access=mmap.ACCESS_READ)
message_offsets = []
for m in re.finditer(r'(?s)\n\nFrom', buf):
    message_offsets.append(m.start())

msgfp = StringIO(buf[message_offsets[-2] + 2:message_offsets[-1] + 2])
msg = email.message_from_file(msgfp)
print msg['to']

  -- erno



More information about the Python-list mailing list