Unexpected timing results with file I/O

Mon Feb 4 13:12:14 EST 2008

On Feb 4, 12:53 pm, rdahlstrom <roger.dahlst... at gmail.com> wrote:
> On Feb 4, 10:17 am, Steven D'Aprano <st... at REMOVE-THIS-
>
>
>
> cybersource.com.au> wrote:
> > After reading an earlier thread about opening and closing lots of files,
> > I thought I'd do a little experiment.
>
> > Suppose you have a whole lot of files, and you need to open each one,
> > append a string, then close them. There's two obvious ways to do it:
> > group your code by file, or group your code by procedure.
>
> > # Method one: grouped by file.
> > for each file:
> >     open the file, append the string, then close it
>
> > # Method two: grouped by procedure.
> > for each file:
> >     open the file
> > for each open file:
> >     append the string
> > for each open file:
> >     close the file
>
> > If you have N files, both methods make the same number of I/O calls: N
> > opens, N writes, N closes. Which is faster?
>
> > Intuitively, the first method has *got* to be faster, right? It's got one
> > loop instead of three and it doesn't build an intermediate list of open
> > file objects. It's so *obviously* going to be faster that it is hardly
> > worth bothering to check it with timeit, right?
>
> > Well, I wouldn't be writing unless that intuitive result was wrong. So
> > here's my test results:
>
> > Method 1:
>
> > >>> import timeit
> > >>> names = ['afile' + str(n) for n in range(1000)]
> > >>> T = timeit.Timer('''for name in names:
>
> > ...     fp = open(name, 'a'); fp.write('xyz\\n'); fp.close()
> > ... ''', 'from __main__ import names')>>> min(T.repeat(6, 500))
>
> > 17.391216039657593
>
> > Method 2:
>
> > >>> for name in names:  # reset the files to an empty state.
>
> > ...     fp = open(name, 'w'); fp.close()
> > ...>>> T = timeit.Timer('''files = [open(name, 'a') for name in names]
>
> > ... for fp in files:
> > ...     fp.write('xyz\\n')
> > ... for fp in files:
> > ...     fp.close()
> > ... ''', '''from __main__ import names''')>>> min(T.repeat(6, 500))
>
> > 16.823362112045288
>
> > Surprisingly, Method 2 is a smidgen faster, by about half a second over
> > 500,000 open-write-close cycles. It's not much faster, but it's
> > consistent, over many tests, changing many of the parameters (e.g. the
> > number of files, the number of runs per timeit test, etc.).
>
> > I'm using Linux and Python 2.5.
>
> > So, what's going on? Can anyone explain why the code which does more work
> > takes less time?
>
> > --
> > Steven
>
> The code that does more work takes more time.  The second one does
> quite a bit less work.  Think of it like this:
>
> You have 500,000 people to fit through a door.  Here are your options:
>
> 1.  For each person, open the door, walk through the door, then close
> the door.
> 2.  Open the door, allow everyone to walk through, then close the
> door.
>
> Which one would you say would be a more efficient way to fit 500,000
> people through the door?

Bad analogy.  A better analogy would be if each person has their own
door to walk through.

My hunch is that is has to do with the OS I/O scheduling.  Closing a
file triggers a cache flush, which in turn triggers the I/O to
schedule a write to disk; the OS scheduler is perhaps more efficient
(for a given number of total writes) when it can combine many writes
at the same time.

Carl Banks