xreadlines (was Re: while true: !!!)
Alex Martelli
aleaxit at yahoo.com
Fri Dec 15 07:58:25 EST 2000
"Neelakantan Krishnaswami" <neelk at alum.mit.edu> wrote in message
news:slrn93jagg.fl.neelk at alum.mit.edu...
> On Thu, 14 Dec 2000 11:14:38 +0100, Alex Martelli <aleaxit at yahoo.com>
wrote:
> >> stdin. I've used fileinput to go through big lists of files (10,000+
email
> >> messages) and it works great. It doesn't appear to do any buffering
> >> itself--it uses file.readline() to read the files.
> >
> > If this was a performance problem, it could of course also be fixed
> > in a future fileinput version without changing code that uses it (again,
> > in-place-rewriting would probably have to inhibit the optimization,
> > although that isn't entirely clear).
>
> While it's true that fileinput is somewhat slow, but it can easily be
> made faster than the usual while 1: loop everyone uses.(Relatively
Yep, exactly my point -- and the chunking of readlines was exactly
what I had in mind here as the performance-fix... thanks for doing
the actual work, which I lazily skipped! With a larger buffer and
more streamlined __getitem__ (no error-checking, optimization for
the most-frequent case) I can get to roughly 1/2 of readlines()...:
class LinesOf:
def __init__(self, file, chunkSize=256*1024):
self.file = file
self.chunkSize = chunkSize
self.start = 0
self.refill()
def refill(self):
self.data = self.file.readlines(self.chunkSize)
def __getitem__(self, i):
try: return self.data[i-self.start]
except IndexError:
self.start += len(self.data)
self.refill()
if not self.data: raise IndexError
return self.data[i-self.start]
import time
def withReadlines(file):
start = time.clock()
i = 0
bytes = 0
for line in file.readlines():
#i+=1
#bytes+=len(line)
pass
stend = time.clock()
return i, bytes, stend-start
def withLinesOf(file):
start = time.clock()
i = 0
bytes = 0
for line in LinesOf(file):
#i+=1
#bytes+=len(line)
pass
stend = time.clock()
return i, bytes, stend-start
def test(filename):
file=open(filename)
print withReadlines(file)
file.close()
file=open(filename)
print withLinesOf(file)
file.close()
if __name__=='__main__':
import sys
try: filename = sys.argv[1]
except IndexError: filename = 'aaa.py'
test(filename)
The operations in the for-loops are commented out to ensure
we're not timing them too -- decommenting them helps ensure
that LinesOf is actually working, and gives an idea of the
magnitude of the test:
D:\PySym>python aaa.py \winnt\profiles\martelli\personal\findin~1.htm
(6072, 444748, 0.075308712333910496)
(6072, 444748, 0.14522679691782142)
the 'Finding of Facts' HTML file from Microsoft's antitrust
cause -- over the chunksize, to ensure refilling is exercised.
OK, without the operations in the loop, we have:
D:\PySym>python aaa.py \winnt\profiles\martelli\personal\findin~1.htm
(0, 0, 0.052779039576527305)
(0, 0, 0.10314101285470281)
D:\PySym>python aaa.py \winnt\profiles\martelli\personal\findin~1.htm
(0, 0, 0.052812563380942722)
(0, 0, 0.1092364785925366)
Best to run it twice to ensure against cache effects -- first time
I ran it, LinesOf appeared *FASTER*... because it's run _after_
readlines, so it benefited from OS caching of the file!-).
Alex
More information about the Python-list
mailing list