Efficient scanning of mbox files

Martin Franklin mfranklin1 at gatwick.westerngeco.slb.com
Mon Nov 11 07:25:44 EST 2002


On Mon, 2002-11-11 at 12:06, Martin Franklin wrote:
> On Mon, 2002-11-11 at 11:42, Moore, Paul wrote:
> > 
> >     def add_group(self, id, file):
> >         print "Opening file", file, "for group", id
> >         fp = open(file, "rb")
> >         posns = []
> >         oldpos = 0
> >         n = 0
> >         while 1:
> >             line = fp.readline()
> >             if not line: break
> >             if FROM_RE.match(line):
> >                 n += 1
> >                 posns.append(oldpos)
> >             oldpos = fp.tell()
> >         fp.close()
> >         posns.append(oldpos)
> >         print "Group", id, "- articles(posns) =", n, len(posns)
> >         self.groups[id] = (file, n, posns)
> > 
> > -- 
> > http://mail.python.org/mailman/listinfo/python-list
> 
> Paul,
> 
> 
> I ran the above example on my Python folder (7000+ messages...)
> it took 12 seconds to process.  Then I changed the 
> if FROM_RE.match(line):
> 
> to
> 
> if line.startswith("From "):
> 
> 
> And got a 2 second speed up....  
> 
> Then I slurped the file into a cStringIO.StringIO object and got it down
> to 5 seconds.....
> 
> 


Another thought....  if you have Python 2.2 (or greater)  you can
iterate through the file :-

    for line in fp:
        if line.startswith("From "):
            posns.append(oldpos)


Again this should shave a second or two from the result...


This is my fastest:-


import time
import cStringIO

groups={}

def add_group(id, file,  fp):
    print "Opening file", file, "for group", id
    posns = []
    oldpos = 0
    for line in fp:
        if line.startswith("From "):
            posns.append(oldpos)
        oldpos = fp.tell()
    posns.append(oldpos)
    n=len(posns)-1
    print "Group", id, "- articles(posns) =", n, len(posns)
    groups[id] = (file, n, posns)


cfile=cStringIO.StringIO(open("Mail/Python").read())
cfile.seek(0)
add_group(1,"/home/bpse/Mail/Python", cfile)
cfile.close()
print time.clock()














More information about the Python-list mailing list