Parsing a file with iterators

George Sakkis george.sakkis at gmail.com
Sat Oct 18 00:18:41 EDT 2008


On Oct 17, 12:45 pm, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> On Fri, 17 Oct 2008 11:42:05 -0400, Luis Zarrabeitia wrote:
> > I need to parse a file, text file. The format is something like that:
>
> > TYPE1 metadata
> > data line 1
> > data line 2
> > ...
> > data line N
> > TYPE2 metadata
> > data line 1
> > ...
> > TYPE3 metadata
> > ...
> > […]
> > because when the parser iterates over the input, it can't know that it
> > finished processing the section until it reads the next "TYPE" line
> > (actually, until it reads the first line that it cannot parse, which if
> > everything went well, should be the 'TYPE'), but once it reads it, it is
> > no longer available to the outer loop. I wouldn't like to leak the
> > internals of the parsers to the outside.
>
> > What could I do?
> > (to the curious: the format is a dialect of the E00 used in GIS)
>
> Group the lines before processing and feed each group to the right parser:
>
> import sys
> from itertools import groupby, imap
> from operator import itemgetter
>
> def parse_a(metadata, lines):
>     print 'parser a', metadata
>     for line in lines:
>         print 'a', line
>
> def parse_b(metadata, lines):
>     print 'parser b', metadata
>     for line in lines:
>         print 'b', line
>
> def parse_c(metadata, lines):
>     print 'parser c', metadata
>     for line in lines:
>         print 'c', line
>
> def test_for_type(line):
>     return line.startswith('TYPE')
>
> def parse(lines):
>     def tag():
>         type_line = None
>         for line in lines:
>             if test_for_type(line):
>                 type_line = line
>             else:
>                 yield (type_line, line)
>
>     type2parser = {'TYPE1': parse_a,
>                    'TYPE2': parse_b,
>                    'TYPE3': parse_c }
>
>     for type_line, group in groupby(tag(), itemgetter(0)):
>         type_id, metadata = type_line.split(' ', 1)
>         type2parser[type_id](metadata, imap(itemgetter(1), group))
>
> def main():
>     parse(sys.stdin)

I like groupby and find it very powerful but I think it complicates
things here instead of simplifying them. I would instead create a
parser instance for every section as soon as the TYPE line is read and
then feed it one data line at a time (or if all the data lines must or
should be given at once, append them in a list and feed them all as
soon as the next section is found), something like:

class parse_a(object):
    def __init__(self, metadata):
        print 'parser a', metadata
    def parse(self, line):
        print 'a', line

# similar for parse_b and parse_c
# ...

def parse(lines):
    parse = None
    for line in lines:
        if test_for_type(line):
            type_id, metadata = line.split(' ', 1)
            parse = type2parser[type_id](metadata).parse
        else:
            parse(line)

George



More information about the Python-list mailing list