[Tutor] parsing a "chunked" text file

Thu Mar 4 01:23:06 CET 2010

Hello Steven,

Is there a big difference to write your first functions as below because 
I am not familiar with yield keyword?

def skip_blanks(lines):
    """Remove leading and trailing whitespace, ignore blank lines."""
    return [line.strip() in lines if line.strip()]

I tried to write as well the second function but it is not as straight 
forward.
I begin to understand the use of yield in it.

Regards
Karim

Steven D'Aprano wrote:
> On Tue, 2 Mar 2010 05:22:43 pm Andrew Fithian wrote:
>   
>> Hi tutor,
>>
>> I have a large text file that has chunks of data like this:
>>
>> headerA n1
>> line 1
>> line 2
>> ...
>> line n1
>> headerB n2
>> line 1
>> line 2
>> ...
>> line n2
>>
>> Where each chunk is a header and the lines that follow it (up to the
>> next header). A header has the number of lines in the chunk as its
>> second field.
>>     
>
> And what happens if the header is wrong? How do you handle situations 
> like missing headers and empty sections, header lines which are wrong, 
> and duplicate headers?
>
> line 1
> line 2
> headerB 0
> headerC 1
> line 1
> headerD 2
> line 1
> line 2
> line 3
> line 4
> headerE 23
> line 1
> line 2
> headerB 1
> line 1
>
>
>
> This is a policy decision: do you try to recover, raise an exception, 
> raise a warning, pad missing lines as blank, throw away excess lines, 
> or what?
>
>
>   
>> I would like to turn this file into a dictionary like:
>> dict = {'headerA':[line 1, line 2, ... , line n1], 'headerB':[line1,
>> line 2, ... , line n2]}
>>
>> Is there a way to do this with a dictionary comprehension or do I
>> have to iterate over the file with a "while 1" loop?
>>     
>
> I wouldn't do either. I would treat this as a pipe-line problem: you 
> have a series of lines that need to be processed. You can feed them 
> through a pipe-line of filters:
>
> def skip_blanks(lines):
>     """Remove leading and trailing whitespace, ignore blank lines."""
>     for line in lines:
>         line = line.strip()
>         if line:
>             yield line
>
> def collate_section(lines):
>     """Return a list of lines that belong in a section."""
>     current_header = ""
>     accumulator = []
>     for line in lines:
>         if line.startswith("header"):
>             yield (current_header, accumulator)
>             current_header = line
>             accumulator = []
>         else:
>             accumulator.append(line)
>     yield (current_header, accumulator)
>
>
> Then put them together like this:
>
>
> fp = open("my_file.dat", "r")
> data = {}  # don't shadow the built-in dict
> non_blank_lines = skip_blanks(fp)
> sections = collate_sections(non_blank_lines)
> for (header, lines) in sections:
>     data[header] = lines
>
>
> Of course you can add your own error checking.
>
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100304/ca9562a1/attachment.html>