Lazy file.readlines()?

Neil Schemenauer nascheme at ucalgary.ca
Wed Sep 15 19:21:02 EDT 1999


Fredrik Lundh <fredrik at pythonware.com> wrote:
>Hrvoje Niksic <hniksic at srce.hr> wrote:
>> > - reading a file one line at a time (self.__fp.readline())
>> 
>> I don't see an alternative to this, except to read the whole file at
>> once, which I am trying to avoid, as the files are large.
>
>note that:
>
>    lines = fp.readlines(16384)
>    if not lines:
>        break
>    for line in lines:
>        ...
>
>is usually much faster than
>
>    line = fp.readline()
>    if not line:
>        break
>    ...

What would be cool is if readlines() returned a lazy sequence
object (ie. only read as much is needed using a certain block
size).  This should give the advantages of readlines() without
the concern about sucking up a huge file all at once.

I implemented this idea (probably badly) in pure Python
and got about a 2x speedup verses readline().  It is a small
module so I will post it here.


    being-lazy-has-its-advantages'ly Neil


import string

class BlockFile:
    def __init__(self, file, blocksize=1024*40, sep='\n'):
        self.file = file
        self.blocksize = blocksize
        self.sep = sep
        self.line = -1
        self.lines = []
        self.end = ''

    def __getitem__(self, i):
        try:
            self.line = self.line+1
            return self.lines[self.line]
        except IndexError:
            self.line = 0
            self._get_block()
            return self.lines[0]

    def _get_block(self):
        data = self.file.read(self.blocksize)
        if len(data) == 0:
            raise IndexError
        self.lines = string.split(data, self.sep)
        self.lines[0] = self.lines[0] + self.end
        if len(data) == self.blocksize:
            self.end = self.lines[-1]
            del self.lines[-1] # this _should_ be fast
        else:
            self.end = ''

def test_block(input):
    for l in BlockFile(open(input)):
        pass

def test_normal(input):
    f = open(input)
    while 1:
        l = f.readline()
        if not l:
            break
    

def measure(function, *args):
    import time
    t1 = time.time()
    apply(function, args)
    t2 = time.time()
    apply(function, args)
    return t2-t1

if __name__ == '__main__':
    import sys
    print 'block time', measure(test_block, sys.argv[1])
    print 'normal time', measure(test_normal, sys.argv[1])




More information about the Python-list mailing list