Reading variable length records...

Alex Martelli aleax at aleax.it
Thu Sep 13 09:40:14 EDT 2001


"Bjorn Pettersen" <BPettersen at NAREX.com> wrote in message
news:mailman.1000333216.26995.python-list at python.org...
"""
I'm trying to read records from a 2 GB datafile, but my brain has
stopped working, so I was wondering if someone has allready solved this
problem. The records are variable length and are separated by a five
character delimiter. I was trying to use file.read(n) with a blocksize
of ~1Mb, but got a serious brainfart when trying to think of how to
handle the case where only part of the delimiter was read in the current
block.
"""
Net of obvious tuning, you could try:

delimiter = 'aeiou'
bufsize = 1024*1024*1024
fileob = open("thebigfile.dat","r")
buffer = fileob.read(bufsize)
record_start = 0
while 1:
    record_end = buffer.find(delimiter, record_start)
    if record_end>=record_start:
        process_record(buffer[record_start:record_end])
        record_start = record_end+len(delimiter)
    else:
        buffer = buffer[record_start:]
        newdata = fileob.read(bufsize-len(buffer))
        if not newdata:
            trailing_record(buffer)
            break
        buffer += newdata
        record_start = 0

This assumes that a full record, delimiter included, will
always be found within any given continuous bufsize-sized
chunk -- i.e. no record is longer than bufsize-len(delimiter).

It's also untested code, as well as unoptimized -- e.g.,
the search for next delimiter might not be from the start
point of the buffer each time -- when we read new data
we might as well reprise the search from len(delimiter)-1
before the new data, as the delimiter cannot occur before
that.  But if it's fast enough, such small extras may
not matter at all:-).


Alex






More information about the Python-list mailing list