iglob performance no better than glob

Sun Jan 31 19:23:05 EST 2010

On Jan 31, 2:44 pm, Peter Otten <__pete... at web.de> wrote:
> Kyp wrote:
> > I have a dir with a large # of files that I need to perform operations
> > on, but only needing to access a subset of the files, i.e. the first
> > 100 files.
>
> > Using glob is very slow, so I ran across iglob, which returns an
> > iterator, which seemed just like what I wanted. I could iterate over
> > the files that I wanted, not having to read the entire dir.
>
> > So the iglob was faster, but accessing the first file took about the
> > same time as glob.glob.
>
> > Here's some code to compare glob vs. iglob performance,  it outputs
> > the time before/after a glob.iglob('*.*') files.next() sequence and a
> > glob.glob('*.*') sequence.
>
> > #!/usr/bin/env python
>
> > import glob,time
> > print '\nTest of glob.iglob'
> > print 'before       iglob:', time.asctime()
> > files = glob.iglob('*.*')
> > print 'after        iglob:',time.asctime()
> > print files.next()
> > print 'after files.next():', time.asctime()
>
> > print '\nTest of glob.glob'
> > print 'before        glob:', time.asctime()
> > files = glob.glob('*.*')
> > print 'after         glob:',time.asctime()
>
> > Here are the results:
>
> > Test of glob.iglob
> > before       iglob: Sun Jan 31 11:09:08 2010
> > after        iglob: Sun Jan 31 11:09:08 2010
> > foo.bar
> > after files.next(): Sun Jan 31 11:09:59 2010
>
> > Test of glob.glob
> > before        glob: Sun Jan 31 11:09:59 2010
> > after         glob: Sun Jan 31 11:10:51 2010
>
> > The results are about the same for the 2 approaches, both took about
> > 51 seconds. Am I doing something wrong with iglob?
>
> No, but iglob() being lazy is pointless in your case because it uses
> os.listdir() and fnmatch.filter() underneath which both read the whole
> directory before returning anything.
>
> > Is there a way to get the first X # of files from a dir with lots of
> > files, that does not take a long time to run?
>
> Here's my attempt. It turned out to be more work than expected, so I cut a
> few corners. It's Linux-only "works on my machine" code, but may give you
> some hints on how to proceed.
>
> from ctypes import *
> import fnmatch
> import glob
> import os
> import re
> from itertools import ifilter, imap
>
> class dirent(Structure):
>     "works on my machine ;)"
>     _fields_ = [
>         ("d_ino", c_long),
>         ("d_off", c_long),
>         ("d_reclen", c_ushort),
>         ("d_type", c_ubyte),
>         ("d_name", c_char*256)]
>
> direntp = POINTER(dirent)
>
> LIBC = "libc.so.6"
> cdll.LoadLibrary(LIBC)
> libc = CDLL(LIBC)
> libc.readdir.restype = direntp
>
> def diriter(dir):
>     "lazy partial replacement for os.listdir()"
>     # errors? what errors?
>     dirp = libc.opendir(dir)
>     if not dirp:
>         return
>     try:
>         while True:
>             ep = libc.readdir(dirp)
>             if not ep:
>                 break
>             yield ep.contents.d_name
>     finally:
>         libc.closedir(dirp)
>
> def filter(names, pattern):
>     "lazy partial replacement for fnmatch.filter()"
>     import posixpath
>
>     pattern = os.path.normcase(pattern)
>     r = fnmatch.translate(pattern)
>     r = re.compile(r)
>
>     if os.path is not posixpath:
>         names = imap(os.path.normcase, names)
>
>     return ifilter(r.match, names)
>
> def globiter(path):
>     "lazy partial replacement for glob.glob()"
>     dir, filename = os.path.split(path)
>     if glob.has_magic(dir):
>         raise ValueError("wildcards in directory not supported")
>     return filter(diriter(dir), filename)
>
> if __name__ == "__main__":
>     import sys
>     [pattern] = sys.argv[1:]
>     for name in globiter(pattern):
>         print name
>
> Peter

I'll give it a try, thanx for the reply.
mark