[Tutor] Parsing a multi-line/record text file
Dave Angel
d at davea.name
Sun Nov 11 07:37:36 CET 2012
On 11/11/2012 12:01 AM, Marc wrote:
> Hello,
>
> I am trying to parse a text file with a structure that looks like:
>
> [record: Some text about the record]
So the record delimiter starts with a left bracket, in first column?
And all lines within the record are indented? Use this fact.
> Attribute 1 = Attribute 1 text
> Attribute 3 = Attribute 3 text
> Attribute 4 = Attribute 4 text
> Attribute 7 = Attribute 7 text
>
> [record: Some text about the record]
> Attribute 1 = Attribute 1 text
> Attribute 2 = Attribute 2 text
> Attribute 3 = Attribute 3 text
> Attribute 4 = Attribute 4 text
> Attribute 5 = Attribute 5 text
> Attribute 6 = Attribute 6 text
>
> [record: Some text about the record]
> Attribute 2 = Attribute 2 text
> Attribute 3 = Attribute 3 text
> Attribute 7 = Attribute 7 text
> Attribute 8 = Attribute 8 text
>
> Etc.for many hundreds of records
>
> I am looking to create output that looks like:
>
> Attribute 1 text | Attribute 3 text
> Attribute 1 text | Attribute 3 text
> Blank | Attribute 3 text
>
> Treating each record as a record with its associated lines is the holy grail
> for which I am searching, yet I seem to only be coming up with dead parrots.
> It should be simple, but the answer is eluding me and Google has not been
> helpful.
>
> Pathetic thing is that I do this with Python and XML all the time, but I
> can't seem to figure out a simple text file. I 'm missing something simple,
> I'm sure. Here's the most I have gotten to work (poorly) so far - it gets
> me the correct data, but not in the correct format because the file is being
> handled sequentially, not by record - it's not even close, but I thought I'd
> include it here:
>
> for line in infile:
> while line != '\n':
> Attribute1 = 'Blank'
> Attribute3 = 'Blank'
> line = line.lstrip('\t')
> line = line.rstrip('\n')
> LineElements = line.split('=')
> if LineElements[0] == 'Attribute1 ':
> Attribute1=LineElements[1]
> if LineElements[0] == 'Attribute3 ':
> Attribute3=LineElements[1]
> print("%s | %s\n" % (Attribute1, Attribute3))
>
> Is there a library or example I could be looking at for this? I use lxml
> for xml, but I don't think it will work for this - at least the way I tried
> did not.
I don't think any existing library will fit your format, unless you
happen to be very lucky.
What you probably want is to write a generator function that gives you a
record at a time. It'll take a file object (infile) and it'll yield a
list of lines. Then your main loop would be something like:
for record in records(infile):
attrib1 = attrib2 = ""
for line in record:
line = strip(line)
line_elements = line.split("=")
etc.
here you print out the attrib1/2 as appropriate
I'll leave you to write the records() generator. But the next() method
will probably play a part.
--
DaveA
More information about the Tutor
mailing list