[Tutor] Parsing a multi-line/record text file

Sun Nov 11 07:37:36 CET 2012

On 11/11/2012 12:01 AM, Marc wrote:
> Hello,
>
> I am trying to parse a text file with a structure that looks like:
>
> [record: Some text about the record]

So the record delimiter starts with a left bracket, in first column? 
And all lines within the record are indented?  Use this fact.

> 	Attribute 1 = Attribute 1 text
> 	Attribute 3 = Attribute 3 text
> 	Attribute 4 = Attribute 4 text
> 	Attribute 7 = Attribute 7 text
>
> [record: Some text about the record]
> 	Attribute 1 = Attribute 1 text
> 	Attribute 2 = Attribute 2 text
> 	Attribute 3 = Attribute 3 text
> 	Attribute 4 = Attribute 4 text
> 	Attribute 5 = Attribute 5 text
> 	Attribute 6 = Attribute 6 text
>
> [record: Some text about the record]
> 	Attribute 2 = Attribute 2 text
> 	Attribute 3 = Attribute 3 text
> 	Attribute 7 = Attribute 7 text
> 	Attribute 8 = Attribute 8 text
>
> Etc.for many hundreds of records
>
> I am looking to create output that looks like:
>
> Attribute 1 text | Attribute 3 text
> Attribute 1 text | Attribute 3 text
> Blank                      | Attribute 3 text
>
> Treating each record as a record with its associated lines is the holy grail
> for which I am searching, yet I seem to only be coming up with dead parrots.
> It should be simple, but the answer is eluding me and Google has not been
> helpful.
>
> Pathetic thing is that I do this with Python and XML all the time, but I
> can't seem to figure out a simple text file.  I 'm missing something simple,
> I'm sure.  Here's the most I have gotten to work (poorly) so far - it gets
> me the correct data, but not in the correct format because the file is being
> handled sequentially, not by record - it's not even close, but I thought I'd
> include it here:
>
>      for line in infile:
>           while line != '\n':
>                Attribute1 = 'Blank'
>                Attribute3 = 'Blank'
>                line = line.lstrip('\t')
>                line = line.rstrip('\n')
>                LineElements = line.split('=')
>                 if LineElements[0] == 'Attribute1 ':
> 	    Attribute1=LineElements[1]
>                 if LineElements[0] == 'Attribute3 ':
>                     Attribute3=LineElements[1]
>                print("%s | %s\n" % (Attribute1, Attribute3))
>
> Is there a library or example I could be looking at for this?  I use lxml
> for xml, but I don't think it will work for this - at least the way I tried
> did not.

I don't think any existing library will fit your format, unless you
happen to be very lucky.

What you probably want is to write a generator function that gives you a
record at a time.  It'll take a file object (infile) and it'll yield a
list of lines.  Then your main loop would be something like:

      for record in records(infile):
            attrib1 = attrib2 = ""
            for line in record:
                    line = strip(line)
                    line_elements = line.split("=")
                    etc.
           here you print out the attrib1/2 as appropriate

I'll leave you to write the records() generator.  But the next() method
will probably play a part.

-- 

DaveA