[Tutor] parsing a "chunked" text file
Steven D'Aprano
steve at pearwood.info
Tue Mar 2 16:29:20 CET 2010
On Tue, 2 Mar 2010 05:22:43 pm Andrew Fithian wrote:
> Hi tutor,
>
> I have a large text file that has chunks of data like this:
>
> headerA n1
> line 1
> line 2
> ...
> line n1
> headerB n2
> line 1
> line 2
> ...
> line n2
>
> Where each chunk is a header and the lines that follow it (up to the
> next header). A header has the number of lines in the chunk as its
> second field.
And what happens if the header is wrong? How do you handle situations
like missing headers and empty sections, header lines which are wrong,
and duplicate headers?
line 1
line 2
headerB 0
headerC 1
line 1
headerD 2
line 1
line 2
line 3
line 4
headerE 23
line 1
line 2
headerB 1
line 1
This is a policy decision: do you try to recover, raise an exception,
raise a warning, pad missing lines as blank, throw away excess lines,
or what?
> I would like to turn this file into a dictionary like:
> dict = {'headerA':[line 1, line 2, ... , line n1], 'headerB':[line1,
> line 2, ... , line n2]}
>
> Is there a way to do this with a dictionary comprehension or do I
> have to iterate over the file with a "while 1" loop?
I wouldn't do either. I would treat this as a pipe-line problem: you
have a series of lines that need to be processed. You can feed them
through a pipe-line of filters:
def skip_blanks(lines):
"""Remove leading and trailing whitespace, ignore blank lines."""
for line in lines:
line = line.strip()
if line:
yield line
def collate_section(lines):
"""Return a list of lines that belong in a section."""
current_header = ""
accumulator = []
for line in lines:
if line.startswith("header"):
yield (current_header, accumulator)
current_header = line
accumulator = []
else:
accumulator.append(line)
yield (current_header, accumulator)
Then put them together like this:
fp = open("my_file.dat", "r")
data = {} # don't shadow the built-in dict
non_blank_lines = skip_blanks(fp)
sections = collate_sections(non_blank_lines)
for (header, lines) in sections:
data[header] = lines
Of course you can add your own error checking.
--
Steven D'Aprano
More information about the Tutor
mailing list