looking for speed-up ideas

Thu Feb 6 05:18:42 EST 2003

Ram Bhamidipaty wrote:
> It would be impressive if there were a _pure_ python script that could
> deliver the performance of the grep + sort + tail command pipe line.

Ahh, well if you want pure performance and just want the filename,
then there are other ways that might make it faster

   - try a memory mapped file instead of reading.  (Some OSes
won't like a mmap'ed file >> main memory size)

   - this could be combined with a regexp, of the form
       r"\nF/(\d+)/([^\n]+)"
      combined with a findall

      (yeah, it loads all the filenames into memory.  But I'm
        trying to get performance here, and you've only a few
        million names --> less than a couple hundred MBs.)
      (sort, btw, can use intermediate files as extra memory)

To try the regexp approach even if mmap doesn't work, you
could also do

   pat = re.compile(r"\nF/(\d+)/([^\n]+)")
   data = []
   min_size = 0
   while 1:
     s = infile.read(25*1024*1024) + infile.readline()
     if not s:
       break
     for size, name in pat.findall(s):
       try:
         size = -int(size)
       except ValueError:
         size = -long(size)
       if size > min_size:
         data.append( (size, name) )
         if len(data) > 1000:
             data.sort()
             del data[200:]
             min_size = -data[0][0]

(Untested!)

The idea is that re parsing might be faster than a split and
tuple assignment.

					Andrew
					dalke at dalkescientific.com