FileInput too slow
Terry Reedy
tjreedy at udel.edu
Mon Jan 4 22:30:52 EST 2010
On 1/4/2010 5:35 PM, wiso wrote:
> I'm trying the fileinput module, and I like it, but I don't understand why
> it's so slow... look:
>
> from time import time
> from fileinput import FileInput
>
> file = ['r1_200907.log', 'r1_200908.log', 'r1_200909.log', 'r1_200910.log',
> 'r1_200911.log']
>
> def f1():
> n = 0
> for f in file:
> print "new file: %s" % f
> ff = open(f)
> for line in ff:
> n += 1
> ff.close()
> return n
>
> def f2():
> f = FileInput(file)
> for line in f:
> if f.isfirstline(): print "new file: %s" % f.filename()
> return f.lineno()
>
> def f3(): # f2 simpler
> f = FileInput(file)
> for line in f:
> pass
> return f.lineno()
>
>
> t = time(); f1(); print time()-t # 1.0
> t = time(); f2(); print time()-t # 7.0 !!!
> t = time(); f3(); print time()-t # 5.5
>
>
> I'm using text files, there are 2563150 lines in total.
1. Timings should include platform and Python version.
2. fileinput executes a lot of Python code on top of the underlying file
methods.
Your n += 1 is inadequate as compensation.
Fileinput does at least the following for each line :
try:
line = self._buffer[self._bufindex]
except IndexError:
pass
else:
self._bufindex += 1
self._lineno += 1
self._filelineno += 1
That is 5 attribute accesses, an indexing, and 3 additions
3. You are welcome to read the Python source in
.../pythonxy/Lib/fileinput.py
4. Doc string for 3.1 version says
"Performance: this module is unfortunately one of the slower ways of
processing large numbers of input lines. Nevertheless, a significant
speed-up has been obtained by using readlines(bufsize) instead of
readline(). A new keyword argument, bufsize=N, is present on the
input() function and the FileInput() class to override the default
buffer size."
If your version has bufsize, try something larger than the default of
8*1024, say 1024*1024.
Terry Jan Reedy
More information about the Python-list
mailing list