Breaking String into Values

Jeff Shannon jeff at ccvcorp.com
Fri Mar 29 19:11:04 EST 2002


In article <HD5p8.33836$tg4.398811 at vixen.cso.uiuc.edu>, 
rgwright at ux.cso.uiuc.edu says...
> I am working on reading in a data file format which is set up as a series
> of lines that look like this:
> 
> 3500035000010104A Foo 45
> 
> I want to break up into a variables as follows:
> a = 35000, b = 35000, c = 10104, d = 'A', e = 'Foo', f = 45
> 
> My current code (auto-generated from a data dictionary) looks
> something like this:
> 
>.... 
> Is there a better way to do this? The files have around 1000-8000 lines each
> so I would like it to be fast. Is there a package around that someone has
> coded up as a C-extension to do this?

I've done something similar to this, and find it to be not all 
that slow.  (I haven't timed it, but on my 500-1000 line files 
the processing time was no more than about a second.)  The most 
significant difference in my approach, was that I parsed each 
line into a class object, and I used a dictionary to define the 
beginning and end of each field.  (In my case, there were fields 
in the data that I was uninterested in, so this allowed me to 
grab only those sections I needed, and by defining my format 
separately from my code, it made it easier to adjust.)

So, I would have something like this:

format = { a: (0,5),
           b: (5,10),
           c: (10,15),
           ....        }

class Record:
    def __init__(self, data)
        for field,zone in format.keys():
            setattr(self, field, data[zone[0]:zone[1]])

If you want to do error checking of most of the fields, I'd 
define a series of functions that do whatever checking and/or 
converting you need, and include those in the dictionary:

def ConfirmValidId():
    ....

format = { a: (ConfirmValidId, 0, 5),
           b: (int, 5, 10),   ... }

Then your setattr line becomes:

        setattr(self, field, zone[0]( data[zone[1]:zone[2]] ) )
 
You can also define various access methods and such on your 
Record class, making it easier to use.  If each record is only a 
few hundred bytes, then it's probably not a problem to read your 
entire file into memory, too.  (It'll take a couple of megs of 
memory, which is trivial on a modern machine.)

recordlist = []
for line in datafile.xreadlines():
    recordlist.append( Record(line) )

for rec in recordlist:
    Process(rec) 
    ...

This should be fast enough unless you've got *very* strict time 
constraints.

-- 

Jeff Shannon
Technician/Programmer
Credit International



More information about the Python-list mailing list