Finding messages in huge mboxes
Erno Kuusela
erno-news at erno.iki.fi
Tue Feb 3 15:21:33 EST 2004
Bastiaan Welmers <haasje at welmers.net> writes:
>
> Especially because I often need messages at the end
> of the MBOX file.
> So I tried the following (scanning messages backwards
> on found "From " lines with readline())
>
> i=0
> j=0
> while 1:
> i+=1
> fp.seek(-i, SEEK_TO_END=2)
> line = fp.readline()
> if not line:
> break
> if line[:5] == 'From ':
> j+=1
> if j == total_messages - message_number_needed:
> archive.seekp = fp.tell()
> message = archive.next()
> # message found
>
> But also seems to be slow and CPU consuming.
something like this might work. the loop below scanned a 115MB mailbox
in about 1 second on a 1.2ghz k7. extracts the next-to-last message,
but you get the idea. if you don't want to read the file into cache,
you could adapt it to start with a smaller mmapped chunk from the end
of the file and enlarge it until you find what you want.
import os, re, mmap, sys
from cStringIO import StringIO
import email
fd = os.open(sys.argv[1], os.O_RDONLY)
size = os.fstat(fd).st_size
print size
buf = mmap.mmap(fd, size, access=mmap.ACCESS_READ)
message_offsets = []
for m in re.finditer(r'(?s)\n\nFrom', buf):
message_offsets.append(m.start())
msgfp = StringIO(buf[message_offsets[-2] + 2:message_offsets[-1] + 2])
msg = email.message_from_file(msgfp)
print msg['to']
-- erno
More information about the Python-list
mailing list