YATOPS (Yet Another Thread on Python's Speed) (was Re: HELP! Must choose language!)

Thu Jan 9 18:28:16 EST 2003

BranimirPetrovic at yahoo.com (Branimir Petrovic) writes:

> Andy, would you mind showing the snippet of the critical part of your
> "┘just loads up each file and writes it to dest." statement. Not quite
> see/understand what you mean (and how).

I expect from his comment that Andy literally reads the entire file
into memory (e.g., open()s it and just calls read()), and then writes
it out.  So a single iteration through Python code, but potentially a
lot of memory usage.

> Having fair amount of scripting experience of what's (not) achievable
> using M$ pathetic scripting approach (FileSystemsObject & cousins), 
> I had impression that beating robocopy.exe using (any) scripting 
> language is just not in cards.

Well, using the very high level COM scripting approach is going to
have reasonably high overhead to instantiate the various objects along
the way and then to actually issue the various calls amongst the
objects.

Assuming that the actual processing or decision making you have to go
through for each file to be transferred isn't tremendous (e.g., you
just want to enumerate and transfer a tree), there are probably two
main potential performance bottlenecks in such a program - the
traversal of the directory structures and the control of the actual
I/O to perform the necessary operations to recreate the tree and
transfer the file.

The actual file transfer is, perhaps surprisingly, not likely to be
bottlenecked by Python's interpretation, as long as you use a
sufficiently high blocksize for the physical I/O that you minimize the
overhead of any Python loop compared to the time spent performing the
I/O.  For example - a very straight forward Python copy program:

          - - - - - - - - - - - - - - - - - - - - - - - - -
    # copy.py

    import sys
    import time

    BLOCKSIZE = 65536

    if __name__ == "__main__":

        source = open(sys.argv[1],'rb')
        dest   = open(sys.argv[2],'wb')

        size = 0
        start = time.time()
        while 1:
            block = source.read(BLOCKSIZE)
            if not block: break
            dest.write(block)
            size = size + len(block)
        end = time.time()

        print "%d bytes in %g s (%g KB)" % \
              (size,end-start,(size/(end-start))/1024)
          - - - - - - - - - - - - - - - - - - - - - - - - -

should be able to achieve a very respectable copy rate because the
lion's share of time will be spent in raw I/O and not that much in the
Python interpreter overhead.

Walking a directory structure can be another matter, depending on the
environment involved, because there you may have a ton of individual
directory elements that need processing by the scripting language so
you have a much higher script interpretative overhead per operation.

With respect to Python, you also have the issue that to perform a walk
of the system tends to require not only identifying lists of files,
but performing separate operations to obtain status information about
those files (os.stat) which decreases efficiency under Windows.  But
you can obtain much higher performance if you are willing to use the
win32all package to use native Win32 operations to walk the directory
structure - in which case you receive directory entry and status
information in a single call.

For example, the following routine scans a directory tree to summarize
numbers and sizes of contained files.  Originally, I implemented this
with the more generic os.path.walk routine.  By moving to the more
direct Win32 calls, and also avoiding some of the higher level path
manipulation calls in favor of direct code, I was able to obtain an
over 4 fold increase in performance.  It can process a filesystem with
about 140,000 files (spread over a large number of directories) in
about 4 minutes.  Or around 550-600 files/s including directory
traversal.  Generally speaking - in a transfer process that's probably
fast enough to keep your pipes busy :-)

There's still a reasonable amount of Python processing per each entry
in the system it is handling, but I found this to be more than enough
performance to avoid having to move out of Python for the application.

          - - - - - - - - - - - - - - - - - - - - - - - - -
def map_disk_usage(root, depth):
    """map_disk_usage(root, depth)

    Recursive function to build up a map of disk usage by directory.

    The directory at 'root' is scanned, and a total count and space for
    files in the directory computed, as well as an estimate of physical
    usage assuming a 64K cluster size (FAT16 on 4GB partition).

    The result for the directory is stored as a tuple (count, size,
    physical) where size and physical are long integers. If there are
    subdirectories, then the function returns a dictionary with the
    top level information at key "None" and each subdirectory at a key
    of its name.  If there are no subdirectories, then this function
    returns the size tuple directly.

    The depth parameter limits how deep the scan will go - levels beyond
    that are rolled up into the prior level.  A value of 0 will return
    aggregate values at or beneath the root, 1 will go 1 level down and so
    on.  A value of -1 can be used for unlimited depth."""

    # For maximum performance (4x), native Win32 calls are used.  The
    # big win is not needing to call stat on each file to get
    # type/size information.  A loss is that the win32api module
    # creates an object per file in the directory, so the peak working
    # set of this implementation for large directories is higher than
    # a listdir/stat implementation.

    curdir  = {}
    dirsize = physize = 0L
    count   = 0

    if root[-1] in ('/','\\'):
        root = root[:-1]
    try:
        names = win32api.FindFiles(root+'\\*')
    except:
        names = []

    if depth > 0:
        recurse_depth = depth - 1
    else:
        recurse_depth = depth

    for name in names:
        # Performance is critical here, so handle join and directory check
        # manually (managed to shave about 50% off by doing this)
        fname = name[8]
        fsize = name[5]

        if (name[0] & 0x10):
            if name[8] not in ('.','..'):
                result = map_disk_usage(root + '\\' + fname, recurse_depth)
                if depth == 0:
                    count   = count   + result[0]
                    dirsize = dirsize + result[1]
                    physize = physize + result[2]
                else:
                    curdir[name[8]] = result
        else:
            count   = count + 1
            dirsize = dirsize + fsize
            physize = physize + fsize + (65536 - (fsize % 65536))

    if not curdir:
        curdir = (count,dirsize,physize)
    else:
        curdir[None] = (count,dirsize,physize)
    return curdir
          - - - - - - - - - - - - - - - - - - - - - - - - -

Of course whether or not an implementation will meet your specific
performance goals depends on what those specific goals are.  And if
the task you are trying to perform is already a good match for a
pre-existing tool (such as robocopy), I'd certainly go ahead and use
the tool - a true compiled executable will generally have an edge over
a script.  But if you're implementing any custom functionality in the
process anyway, don't assume that the directory walk/copy portion
won't be suitable to be scripted.

--
-- David
-- 
/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: db3l at fitlinxx.com  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/