Efficient scanning of mbox files

Martin Franklin mfranklin1 at gatwick.westerngeco.slb.com
Mon Nov 11 07:06:27 EST 2002


On Mon, 2002-11-11 at 11:42, Moore, Paul wrote:
> 
>     def add_group(self, id, file):
>         print "Opening file", file, "for group", id
>         fp = open(file, "rb")
>         posns = []
>         oldpos = 0
>         n = 0
>         while 1:
>             line = fp.readline()
>             if not line: break
>             if FROM_RE.match(line):
>                 n += 1
>                 posns.append(oldpos)
>             oldpos = fp.tell()
>         fp.close()
>         posns.append(oldpos)
>         print "Group", id, "- articles(posns) =", n, len(posns)
>         self.groups[id] = (file, n, posns)
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list

Paul,


I ran the above example on my Python folder (7000+ messages...)
it took 12 seconds to process.  Then I changed the 
if FROM_RE.match(line):

to

if line.startswith("From "):


And got a 2 second speed up....  

Then I slurped the file into a cStringIO.StringIO object and got it down
to 5 seconds.....


Just some thoughts
Martin








More information about the Python-list mailing list