Breaking String into Values
Jeff Shannon
jeff at ccvcorp.com
Fri Mar 29 19:11:04 EST 2002
In article <HD5p8.33836$tg4.398811 at vixen.cso.uiuc.edu>,
rgwright at ux.cso.uiuc.edu says...
> I am working on reading in a data file format which is set up as a series
> of lines that look like this:
>
> 3500035000010104A Foo 45
>
> I want to break up into a variables as follows:
> a = 35000, b = 35000, c = 10104, d = 'A', e = 'Foo', f = 45
>
> My current code (auto-generated from a data dictionary) looks
> something like this:
>
>....
> Is there a better way to do this? The files have around 1000-8000 lines each
> so I would like it to be fast. Is there a package around that someone has
> coded up as a C-extension to do this?
I've done something similar to this, and find it to be not all
that slow. (I haven't timed it, but on my 500-1000 line files
the processing time was no more than about a second.) The most
significant difference in my approach, was that I parsed each
line into a class object, and I used a dictionary to define the
beginning and end of each field. (In my case, there were fields
in the data that I was uninterested in, so this allowed me to
grab only those sections I needed, and by defining my format
separately from my code, it made it easier to adjust.)
So, I would have something like this:
format = { a: (0,5),
b: (5,10),
c: (10,15),
.... }
class Record:
def __init__(self, data)
for field,zone in format.keys():
setattr(self, field, data[zone[0]:zone[1]])
If you want to do error checking of most of the fields, I'd
define a series of functions that do whatever checking and/or
converting you need, and include those in the dictionary:
def ConfirmValidId():
....
format = { a: (ConfirmValidId, 0, 5),
b: (int, 5, 10), ... }
Then your setattr line becomes:
setattr(self, field, zone[0]( data[zone[1]:zone[2]] ) )
You can also define various access methods and such on your
Record class, making it easier to use. If each record is only a
few hundred bytes, then it's probably not a problem to read your
entire file into memory, too. (It'll take a couple of megs of
memory, which is trivial on a modern machine.)
recordlist = []
for line in datafile.xreadlines():
recordlist.append( Record(line) )
for rec in recordlist:
Process(rec)
...
This should be fast enough unless you've got *very* strict time
constraints.
--
Jeff Shannon
Technician/Programmer
Credit International
More information about the Python-list
mailing list