Unexpected timing results with file I/O

rdahlstrom roger.dahlstrom at gmail.com
Mon Feb 4 13:18:39 EST 2008


On Feb 4, 1:12 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
> On Feb 4, 12:53 pm, rdahlstrom <roger.dahlst... at gmail.com> wrote:
>
>
>
> > On Feb 4, 10:17 am, Steven D'Aprano <st... at REMOVE-THIS-
>
> > cybersource.com.au> wrote:
> > > After reading an earlier thread about opening and closing lots of files,
> > > I thought I'd do a little experiment.
>
> > > Suppose you have a whole lot of files, and you need to open each one,
> > > append a string, then close them. There's two obvious ways to do it:
> > > group your code by file, or group your code by procedure.
>
> > > # Method one: grouped by file.
> > > for each file:
> > >     open the file, append the string, then close it
>
> > > # Method two: grouped by procedure.
> > > for each file:
> > >     open the file
> > > for each open file:
> > >     append the string
> > > for each open file:
> > >     close the file
>
> > > If you have N files, both methods make the same number of I/O calls: N
> > > opens, N writes, N closes. Which is faster?
>
> > > Intuitively, the first method has *got* to be faster, right? It's got one
> > > loop instead of three and it doesn't build an intermediate list of open
> > > file objects. It's so *obviously* going to be faster that it is hardly
> > > worth bothering to check it with timeit, right?
>
> > > Well, I wouldn't be writing unless that intuitive result was wrong. So
> > > here's my test results:
>
> > > Method 1:
>
> > > >>> import timeit
> > > >>> names = ['afile' + str(n) for n in range(1000)]
> > > >>> T = timeit.Timer('''for name in names:
>
> > > ...     fp = open(name, 'a'); fp.write('xyz\\n'); fp.close()
> > > ... ''', 'from __main__ import names')>>> min(T.repeat(6, 500))
>
> > > 17.391216039657593
>
> > > Method 2:
>
> > > >>> for name in names:  # reset the files to an empty state.
>
> > > ...     fp = open(name, 'w'); fp.close()
> > > ...>>> T = timeit.Timer('''files = [open(name, 'a') for name in names]
>
> > > ... for fp in files:
> > > ...     fp.write('xyz\\n')
> > > ... for fp in files:
> > > ...     fp.close()
> > > ... ''', '''from __main__ import names''')>>> min(T.repeat(6, 500))
>
> > > 16.823362112045288
>
> > > Surprisingly, Method 2 is a smidgen faster, by about half a second over
> > > 500,000 open-write-close cycles. It's not much faster, but it's
> > > consistent, over many tests, changing many of the parameters (e.g. the
> > > number of files, the number of runs per timeit test, etc.).
>
> > > I'm using Linux and Python 2.5.
>
> > > So, what's going on? Can anyone explain why the code which does more work
> > > takes less time?
>
> > > --
> > > Steven
>
> > The code that does more work takes more time.  The second one does
> > quite a bit less work.  Think of it like this:
>
> > You have 500,000 people to fit through a door.  Here are your options:
>
> > 1.  For each person, open the door, walk through the door, then close
> > the door.
> > 2.  Open the door, allow everyone to walk through, then close the
> > door.
>
> > Which one would you say would be a more efficient way to fit 500,000
> > people through the door?
>
> Bad analogy.  A better analogy would be if each person has their own
> door to walk through.
>
> My hunch is that is has to do with the OS I/O scheduling.  Closing a
> file triggers a cache flush, which in turn triggers the I/O to
> schedule a write to disk; the OS scheduler is perhaps more efficient
> (for a given number of total writes) when it can combine many writes
> at the same time.
>
> Carl Banks

The analogy holds.  It's faster to open the door, do what you need to
do, then close the door than it is to open and close the door each
time.



More information about the Python-list mailing list