[Tutor] parsing a "chunked" text file
Karim Liateni
karim.liateni at free.fr
Thu Mar 4 01:23:06 CET 2010
Hello Steven,
Is there a big difference to write your first functions as below because
I am not familiar with yield keyword?
def skip_blanks(lines):
"""Remove leading and trailing whitespace, ignore blank lines."""
return [line.strip() in lines if line.strip()]
I tried to write as well the second function but it is not as straight
forward.
I begin to understand the use of yield in it.
Regards
Karim
Steven D'Aprano wrote:
> On Tue, 2 Mar 2010 05:22:43 pm Andrew Fithian wrote:
>
>> Hi tutor,
>>
>> I have a large text file that has chunks of data like this:
>>
>> headerA n1
>> line 1
>> line 2
>> ...
>> line n1
>> headerB n2
>> line 1
>> line 2
>> ...
>> line n2
>>
>> Where each chunk is a header and the lines that follow it (up to the
>> next header). A header has the number of lines in the chunk as its
>> second field.
>>
>
> And what happens if the header is wrong? How do you handle situations
> like missing headers and empty sections, header lines which are wrong,
> and duplicate headers?
>
> line 1
> line 2
> headerB 0
> headerC 1
> line 1
> headerD 2
> line 1
> line 2
> line 3
> line 4
> headerE 23
> line 1
> line 2
> headerB 1
> line 1
>
>
>
> This is a policy decision: do you try to recover, raise an exception,
> raise a warning, pad missing lines as blank, throw away excess lines,
> or what?
>
>
>
>> I would like to turn this file into a dictionary like:
>> dict = {'headerA':[line 1, line 2, ... , line n1], 'headerB':[line1,
>> line 2, ... , line n2]}
>>
>> Is there a way to do this with a dictionary comprehension or do I
>> have to iterate over the file with a "while 1" loop?
>>
>
> I wouldn't do either. I would treat this as a pipe-line problem: you
> have a series of lines that need to be processed. You can feed them
> through a pipe-line of filters:
>
> def skip_blanks(lines):
> """Remove leading and trailing whitespace, ignore blank lines."""
> for line in lines:
> line = line.strip()
> if line:
> yield line
>
> def collate_section(lines):
> """Return a list of lines that belong in a section."""
> current_header = ""
> accumulator = []
> for line in lines:
> if line.startswith("header"):
> yield (current_header, accumulator)
> current_header = line
> accumulator = []
> else:
> accumulator.append(line)
> yield (current_header, accumulator)
>
>
> Then put them together like this:
>
>
> fp = open("my_file.dat", "r")
> data = {} # don't shadow the built-in dict
> non_blank_lines = skip_blanks(fp)
> sections = collate_sections(non_blank_lines)
> for (header, lines) in sections:
> data[header] = lines
>
>
> Of course you can add your own error checking.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100304/ca9562a1/attachment.html>
More information about the Tutor
mailing list