FileInput too slow

Terry Reedy tjreedy at udel.edu
Mon Jan 4 22:30:52 EST 2010


On 1/4/2010 5:35 PM, wiso wrote:
> I'm trying the fileinput module, and I like it, but I don't understand why
> it's so slow... look:
>
> from time import time
> from fileinput import FileInput
>
> file = ['r1_200907.log', 'r1_200908.log', 'r1_200909.log', 'r1_200910.log',
> 'r1_200911.log']
>
> def f1():
>    n = 0
>    for f in file:
>      print "new file: %s" % f
>      ff = open(f)
>      for line in ff:
>        n += 1
>      ff.close()
>    return n
>
> def f2():
>    f = FileInput(file)
>    for line in f:
>      if f.isfirstline(): print "new file: %s" % f.filename()
>    return f.lineno()
>
> def f3(): # f2 simpler
>    f = FileInput(file)
>    for line in f:
>      pass
>    return f.lineno()
>
>
> t = time(); f1(); print time()-t # 1.0
> t = time(); f2(); print time()-t # 7.0 !!!
> t = time(); f3(); print time()-t # 5.5
>
>
> I'm using text files, there are 2563150 lines in total.

1. Timings should include platform and Python version.

2. fileinput executes a lot of Python code on top of the underlying file 
methods.

Your n += 1 is inadequate as compensation.

Fileinput does at least the following for each line :

         try:
             line = self._buffer[self._bufindex]
         except IndexError:
             pass
         else:
             self._bufindex += 1
             self._lineno += 1
             self._filelineno += 1

That is 5 attribute accesses, an indexing, and 3 additions

3. You are welcome to read the Python source in 
.../pythonxy/Lib/fileinput.py

4. Doc string for 3.1 version says
  "Performance: this module is unfortunately one of the slower ways of
processing large numbers of input lines.  Nevertheless, a significant
speed-up has been obtained by using readlines(bufsize) instead of
readline().  A new keyword argument, bufsize=N, is present on the
input() function and the FileInput() class to override the default
buffer size."

If your version has bufsize, try something larger than the default of 
8*1024, say 1024*1024.

Terry Jan Reedy




More information about the Python-list mailing list