ASCII delimited files
Thomas A. Bryan
tbryan at python.net
Wed Nov 10 21:03:55 EST 1999
Roger Irwin wrote:
>
> Is there any function or module available for parsing ASCII delimited files,
> before I go and re-invent the wheel writing my own.
I'm not sure exactly what you're looking for. I've appended something
that I was playing with one day. It was just a way to create an object
easily that could parse and validate ascii, delimited files.
It might be terribly slow: I never timed it.
Basically, you create a DelimFldParser object with a list of
DelimParserField subclasses and a delimiter. Each
DelimParserField subclass knows how to handle a specific "column"
of the ASCII file. The DelimFldParser is then handed a file
object (anything with a readline() method, really), and it
returns a list of lists. The inner list is a list of values
returned by the DelimFldParser objects for a specific line.
Oh, I also assume that each line of the file has the same number
of "columns."
I implemented three sample DelimParserField objects. One converts
ascii values to floats. Another checks that the field value is
in a specified list of values. The last is designed to perform
a verification of field values based on a regular expression.
I wrote this thing to read and verify files before importing them
into a database. I never really had much chance to use it, though.
I would love to see someone optimize this thing because it makes the
task of building a parser for a new format of an ASCII file very
simple. It would be great, for example, for dealing with delimited
data exported from a database or for parsing a delimited file
for for import into a database.
---Tom
#!/usr/bin/python
import string
import re
class DelimFldParser:
def __init__(self, fields, delimiter=None):
"""fields is an ordered list of DelimParserField instances"""
self.delimiter = delimiter
self.fields = fields
self.numCols = len(fields)
self.cols = []
for el in fields:
self.cols.append(el.name)
def parseLine(self, line):
list = string.split(line, self.delimiter)
assert len(list) == self.numCols, \
"The following line doesn't have enough fields.\n%s" % line
for idx in range(self.numCols):
list[idx] = self.fields[idx].convert(list[idx])
self.fields[idx].verify(list[idx])
return list
def parseFile(self, fileObj):
data = []
line = fileObj.readline()
while line:
data.append(self.parseLine(line))
line = fileObj.readline()
return data
def __str__(self):
s = '<DelimFldParser: '
for el in self.fields:
s = s + el.name + ', '
s = s[:-2] + ' >'
return s
class DelimParserField:
def __init__(self, name):
self.name = name
def convert(self,value):
return value
def verify(self,value):
pass
class EnumField(DelimParserField):
def __init__(self,name,validValues):
DelimParserField.__init__(self,name)
self.validValues = validValues[:]
def verify(self,value):
assert value in self.validValues, \
"%s not in %s on the following line" % (value,self.validValues)
class NumericRngField(DelimParserField):
def __init__(self,name,start,stop):
DelimParserField.__init__(self,name)
self.min = start
self.max = stop
def convert(self,value):
return float(value)
def verify(self,value):
assert value >= self.min and value <= self.max, \
"%s is not between %s an d %s" % (value,self.min,self.max)
class RegexpField(DelimParserField):
def __init__(self,name,regexp,flags=None):
DelimParserField.__init__(self,name)
if flags:
self.re = re.compile(regexp,flags)
else:
self.re = re.compile(regexp)
def verify(self,value):
assert self.re.search(value), \
"%s does not match the pattern '%s'" % (value, self.re.pattern)
if __name__ == '__main__':
fh = open('delimParser.test','w')
fh.write("""a 10 9/10/1999
b 3.5 10/11/1974
c 5.7 09/10/1974
""")
fh.close()
fh = open('delimParser.test','r')
myParser = DelimFldParser((EnumField('Enum',('a','b','c')),
NumericRngField('Range',0,10),
RegexpField('RegExp','\d{1,2}/\d{2}/\d{4}')))
print myParser
output = myParser.parseFile(fh)
fh.close()
print output
More information about the Python-list
mailing list