[Tutor] Read-ahead for large fixed-width binary files?

Sun Nov 18 04:10:10 CET 2007

Alan Gauld wrote:

"Marc Tompkins" <marc.tompkins at gmail.com> wrote
> realized I can implement this myself, using 'read(bigsize)' -
> currently I'm using 'read(recordsize)'; I just need to add an extra
> loop around my record reads.  Please disregard...
If you just want to navigate to a specific record then it might be
easier to use seek(), that will save you having to read all the
previous records into memory.

No, I need to parse the entire file, checking records as I go.  Here's the
solution I came up with - I'm sure it could be optimized, but it's already
about six times faster than going record-by-record:

def loadInsurance(self):
    header = ('Code', 'Name')
    Global.Ins.append(header)
    obj = Insurance()
    recLen = obj.RecordLength
    for offNum, offPath in Global.offices.iteritems():
        if (offPath.Ref == ''):
            offPath.Ref = offPath.Default
        with open(offPath.Ref + obj.TLA + '.dat','rb') as inFile:
            tmpIn = inFile.read(recLen)                 # throw away the
header record
            tmpIn = inFile.read(recLen*4096)
            while not (len(tmpIn) < recLen):
                buf = StringIO.StringIO(tmpIn)
                inRec = buf.read(recLen)
                while not (len(inRec) < recLen):
                    obj = Insurance(inRec)
                    if (obj.Valid):
                        Global.Ins.append(obj.ID, obj.Name)
                    inRec = buf.read(recLen)
                buf.close()
                tmpIn = inFile.read(recLen*4096)

Obviously this is taken out of context, and I'm afraid I'm too lazy to
sanitize it (much) for posting right now, so here's a brief summary instead.

1-  I don't want my calling code to need to know many details.  So if I
create an object with no parameters, it provides me with the record length
(files vary from 80-byte records up to 1024) and the TLA portion of the
filename (the data files are named in the format xxTLA.dat, where xx is the
2-digit office number and TLA is the three-letter acronym for what the file
contains - e.g. INS for insurance.)

2-  Using the information I just obtained, I then read through the file one
record-length chunk at a time, creating an object out of each chunk and
reading the attributes of that object.  In the next version of my class
library, I'll move the whole list-generation logic inside the classes so I
can just pass in a filename and receive a list... but that's one for my
copious free time.

3-  Each file contains a header record, which is pure garbage.  I read it in
and throw it away before I even begin.  (I could seek to just past it
instead - would it really be more efficient?)

4-  Now here's where the read-ahead buffer comes in - I (attempt to) read
4096 records' worth of data, and store it in a StringIO file-like object.
(4096 is just a number I pulled out of the air, but I've tried increasing
and decreasing it, and it seems good.  If I have the time, I may benchmark
to find the best number for each record length, and retrieve that number
along with the record length and TLA.  Of course, the optimal number
probably varies per machine, so maybe I won't bother.)

5-  Now I go through the buffer, one record's worth at a time, and do
whatever I'm doing with the records - in this case, I'm making a list of
insurance company IDs and names to display in a wx.CheckListCtrl.

6-  If I try to read past the end of the file, there's no error - so I need
to check the size of what's returned.  If it's smaller than recLen, I know
I've hit the end.
 6a- When I hit the end of the buffer, I close it and read in another 4096
records.
 6b- When I try to read 4096 records, and end up with less than recLen, I
know I've hit the end of the file.

I've only tested on a few machines/client databases so far, but when I added
step 4, processing a 250MB transaction table (256-byte records) went from
nearly 30 seconds down to about 3.5 seconds.  Other results have varied, but
they've all shown improvement.

If anybody sees any glaring inefficiencies, let me know; OTOH if anybody
else needs to do something similar... here's one way to do it.

-- 
www.fsrtechnologies.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20071117/8e50138e/attachment-0001.htm