[Tutor] parsing a "chunked" text file

Tue Mar 2 16:29:20 CET 2010

On Tue, 2 Mar 2010 05:22:43 pm Andrew Fithian wrote:
> Hi tutor,
>
> I have a large text file that has chunks of data like this:
>
> headerA n1
> line 1
> line 2
> ...
> line n1
> headerB n2
> line 1
> line 2
> ...
> line n2
>
> Where each chunk is a header and the lines that follow it (up to the
> next header). A header has the number of lines in the chunk as its
> second field.

And what happens if the header is wrong? How do you handle situations 
like missing headers and empty sections, header lines which are wrong, 
and duplicate headers?

line 1
line 2
headerB 0
headerC 1
line 1
headerD 2
line 1
line 2
line 3
line 4
headerE 23
line 1
line 2
headerB 1
line 1

This is a policy decision: do you try to recover, raise an exception, 
raise a warning, pad missing lines as blank, throw away excess lines, 
or what?

> I would like to turn this file into a dictionary like:
> dict = {'headerA':[line 1, line 2, ... , line n1], 'headerB':[line1,
> line 2, ... , line n2]}
>
> Is there a way to do this with a dictionary comprehension or do I
> have to iterate over the file with a "while 1" loop?

I wouldn't do either. I would treat this as a pipe-line problem: you 
have a series of lines that need to be processed. You can feed them 
through a pipe-line of filters:

def skip_blanks(lines):
    """Remove leading and trailing whitespace, ignore blank lines."""
    for line in lines:
        line = line.strip()
        if line:
            yield line

def collate_section(lines):
    """Return a list of lines that belong in a section."""
    current_header = ""
    accumulator = []
    for line in lines:
        if line.startswith("header"):
            yield (current_header, accumulator)
            current_header = line
            accumulator = []
        else:
            accumulator.append(line)
    yield (current_header, accumulator)

Then put them together like this:

fp = open("my_file.dat", "r")
data = {}  # don't shadow the built-in dict
non_blank_lines = skip_blanks(fp)
sections = collate_sections(non_blank_lines)
for (header, lines) in sections:
    data[header] = lines

Of course you can add your own error checking.

-- 
Steven D'Aprano