Parsing a file based on differing delimiters

Kylotan kylotan at hotmail.com
Tue Oct 21 18:21:13 EDT 2003


I have a text file where the fields are delimited in various different
ways. For example, strings are terminated with a tilde, numbers are
terminated with whitespace, and some identifiers are terminated with a
newline. This means I can't effectively use split() except on a small
scale. For most of the file I can just call one of several functions I
wrote that read in just as much data as is required from the input
string, and return the value and modified string. Much of the code
therefore looks like this:

filedata = file('whatever').read()
firstWord, filedata = GetWord(filedata)
nextNumber, filedata = GetNumber(filedata)

This works, but is obviously ugly. Is there a cleaner alternative that
can avoid me having to re-assign data all the time that will 'consume'
the value from the stream)? I'm a bit unclear on the whole passing by
value/reference thing. I'm guessing that while GetWord gets a
reference to the 'filedata' string, assigning to that will just reseat
the reference and not change the original string.

The other problem is that parts of the format are potentially repeated
an arbitrary number of times and therefore a degree of lookahead is
required. If I've already extracted a token and then find out I need
it, putting it back is awkward. Yet there is nowhere near enough
complexity or repetition in the file format to justify a formal
grammar or anything like that.

All in all, in the basic parsing code I am doing a lot more operations
on the input data than I would like. I can see how I'd encapsulate
this behind functions if I was willing to iterate through the data
character by character like I would in C++. But I am hoping that
Python can, as usual, save me from the majority of this drudgery
somehow.

Any help appreciated.

-- 
Ben Sizer




More information about the Python-list mailing list