Why is text file processing SO slow?

Wed Sep 20 17:06:22 EDT 2000

Jaime
> I have a program that processes large files ~25 megs.  The algorithm
> that I use is as follows:
> 
> while not EOF:
>     readline
>     store line in list
> 
>     if line == 'End of set' marker
> 	  dump the list to appopriate file (produces 2 sep files)
> 	  empty the list
>     elsif line contains 'EDS' regexp
> 	  set the eds flag

Part of the slowdown is probably caused by repeatedly reading the file a
line at a time.  It's much faster to do read in lots of lines at once.
I have a class I use for this:

class Lazy_file:

    '''Wrapper for  file objects that  takes care of reading  in large
    chunks of a file at a time but allows processing a line at a time.
    Can be used it the form

    for line in Lazy_file(file_object):
        process(line)

    or in the form

    f = Lazy_file(file_object)
    while f.readline():
        process(f.line)
    '''

    def __init__(self,

                 # File object to wrap.
                 file,

                 # How often to print out the number of lines that
                 # have been processed.  Value of None means don't
                 # print out numbers of lines at all.
                 print_count_period=None,

                 # How much of the file to read in each time.
                 sizehint=10**7):

        self.file = file

        # Keeps track of how many lines have been read in.
        self.line_count=0
        self.sizehint = sizehint
        self.print_count_period = print_count_period

        # Keeps the contents of the file that have been read in, but
        # not yet processed.
        self.buffer = None

        # Whether or not to strip the trailing newline.
        self.truncate_p = None

        # Whether or not the entire file has been read.
        self.done_p = None

    def readline(self):

        '''Return  the next  line in  the  file object  that has  been
        wrapped.'''

        if not self.buffer:

            # Get some more of the file.
            self.buffer = self.file.readlines(self.sizehint)

            # ...it's much more efficient to pop things off the tail
            # of a list than the head, so swap them.
            self.buffer.reverse() 

        if not self.buffer:

            # Nothing was returned from the file object, so must have
            # reached the end.
            self.line = ''
            self.done_p = 1
        else:

            self.line = self.buffer.pop()
            if self.truncate_p:

                # Remove the trailing newline, if there is one. (There
                # isn't necessarily at the end of the file.)  (I guess
                # this isn't going to work on platforms where
                # linebreaks aren't denoted by '\n')
                if self.line[-1] == '\n':
                    self.line = self.line[: -1]

            if self.print_count_period and \
               not (self.line_count % self.print_count_period):
                print self.line_count
            self.line_count=self.line_count+1

        return self.line

    def __getitem__(self, index):

        '''Return the next  line in the file.  Raise  an IndexError if
        the file is  finished, so that the class can be  used in a for
        loop.'''

        if self.readline():
            return self.line
        else:
            raise IndexError, 'File finished'

def lazy_file(filename, print_count_period=None,
               sizehint=10**7, mode='r', open=open):
    file = open(filename, mode)
    return Lazy_file(file, print_count_period, sizehint)

-- 
Speak softly but carry a big carrot.