iglob performance no better than glob

Sun Jan 31 14:44:22 EST 2010

Kyp wrote:

> I have a dir with a large # of files that I need to perform operations
> on, but only needing to access a subset of the files, i.e. the first
> 100 files.
> 
> Using glob is very slow, so I ran across iglob, which returns an
> iterator, which seemed just like what I wanted. I could iterate over
> the files that I wanted, not having to read the entire dir.
> 
> So the iglob was faster, but accessing the first file took about the
> same time as glob.glob.
> 
> Here's some code to compare glob vs. iglob performance,  it outputs
> the time before/after a glob.iglob('*.*') files.next() sequence and a
> glob.glob('*.*') sequence.
> 
> #!/usr/bin/env python
> 
> import glob,time
> print '\nTest of glob.iglob'
> print 'before       iglob:', time.asctime()
> files = glob.iglob('*.*')
> print 'after        iglob:',time.asctime()
> print files.next()
> print 'after files.next():', time.asctime()
> 
> print '\nTest of glob.glob'
> print 'before        glob:', time.asctime()
> files = glob.glob('*.*')
> print 'after         glob:',time.asctime()
> 
> 
> Here are the results:
> 
> Test of glob.iglob
> before       iglob: Sun Jan 31 11:09:08 2010
> after        iglob: Sun Jan 31 11:09:08 2010
> foo.bar
> after files.next(): Sun Jan 31 11:09:59 2010
> 
> Test of glob.glob
> before        glob: Sun Jan 31 11:09:59 2010
> after         glob: Sun Jan 31 11:10:51 2010
> 
> The results are about the same for the 2 approaches, both took about
> 51 seconds. Am I doing something wrong with iglob?

No, but iglob() being lazy is pointless in your case because it uses 
os.listdir() and fnmatch.filter() underneath which both read the whole 
directory before returning anything.

> Is there a way to get the first X # of files from a dir with lots of
> files, that does not take a long time to run?

Here's my attempt. It turned out to be more work than expected, so I cut a 
few corners. It's Linux-only "works on my machine" code, but may give you 
some hints on how to proceed.

from ctypes import *
import fnmatch
import glob
import os
import re
from itertools import ifilter, imap

class dirent(Structure):
    "works on my machine ;)"
    _fields_ = [
        ("d_ino", c_long),
        ("d_off", c_long),
        ("d_reclen", c_ushort),
        ("d_type", c_ubyte),
        ("d_name", c_char*256)]

direntp = POINTER(dirent)

LIBC = "libc.so.6"
cdll.LoadLibrary(LIBC)
libc = CDLL(LIBC)
libc.readdir.restype = direntp

def diriter(dir):
    "lazy partial replacement for os.listdir()"
    # errors? what errors?
    dirp = libc.opendir(dir)
    if not dirp:
        return
    try:
        while True:
            ep = libc.readdir(dirp)
            if not ep:
                break
            yield ep.contents.d_name
    finally:
        libc.closedir(dirp)

def filter(names, pattern):
    "lazy partial replacement for fnmatch.filter()"
    import posixpath

    pattern = os.path.normcase(pattern)
    r = fnmatch.translate(pattern)
    r = re.compile(r)

    if os.path is not posixpath:
        names = imap(os.path.normcase, names)

    return ifilter(r.match, names)

def globiter(path):
    "lazy partial replacement for glob.glob()"
    dir, filename = os.path.split(path)
    if glob.has_magic(dir):
        raise ValueError("wildcards in directory not supported")
    return filter(diriter(dir), filename)

if __name__ == "__main__":
    import sys
    [pattern] = sys.argv[1:]
    for name in globiter(pattern):
        print name

Peter