Why is text file processing SO slow?
Alex
cut_me_out at hotmail.com
Wed Sep 20 17:06:22 EDT 2000
Jaime
> I have a program that processes large files ~25 megs. The algorithm
> that I use is as follows:
>
> while not EOF:
> readline
> store line in list
>
> if line == 'End of set' marker
> dump the list to appopriate file (produces 2 sep files)
> empty the list
> elsif line contains 'EDS' regexp
> set the eds flag
Part of the slowdown is probably caused by repeatedly reading the file a
line at a time. It's much faster to do read in lots of lines at once.
I have a class I use for this:
class Lazy_file:
'''Wrapper for file objects that takes care of reading in large
chunks of a file at a time but allows processing a line at a time.
Can be used it the form
for line in Lazy_file(file_object):
process(line)
or in the form
f = Lazy_file(file_object)
while f.readline():
process(f.line)
'''
def __init__(self,
# File object to wrap.
file,
# How often to print out the number of lines that
# have been processed. Value of None means don't
# print out numbers of lines at all.
print_count_period=None,
# How much of the file to read in each time.
sizehint=10**7):
self.file = file
# Keeps track of how many lines have been read in.
self.line_count=0
self.sizehint = sizehint
self.print_count_period = print_count_period
# Keeps the contents of the file that have been read in, but
# not yet processed.
self.buffer = None
# Whether or not to strip the trailing newline.
self.truncate_p = None
# Whether or not the entire file has been read.
self.done_p = None
def readline(self):
'''Return the next line in the file object that has been
wrapped.'''
if not self.buffer:
# Get some more of the file.
self.buffer = self.file.readlines(self.sizehint)
# ...it's much more efficient to pop things off the tail
# of a list than the head, so swap them.
self.buffer.reverse()
if not self.buffer:
# Nothing was returned from the file object, so must have
# reached the end.
self.line = ''
self.done_p = 1
else:
self.line = self.buffer.pop()
if self.truncate_p:
# Remove the trailing newline, if there is one. (There
# isn't necessarily at the end of the file.) (I guess
# this isn't going to work on platforms where
# linebreaks aren't denoted by '\n')
if self.line[-1] == '\n':
self.line = self.line[: -1]
if self.print_count_period and \
not (self.line_count % self.print_count_period):
print self.line_count
self.line_count=self.line_count+1
return self.line
def __getitem__(self, index):
'''Return the next line in the file. Raise an IndexError if
the file is finished, so that the class can be used in a for
loop.'''
if self.readline():
return self.line
else:
raise IndexError, 'File finished'
def lazy_file(filename, print_count_period=None,
sizehint=10**7, mode='r', open=open):
file = open(filename, mode)
return Lazy_file(file, print_count_period, sizehint)
--
Speak softly but carry a big carrot.
More information about the Python-list
mailing list