File parser

Mon Aug 29 22:20:27 EDT 2005

It's not clear to me from your posting what possible order the tags may
be inn. Assuming you will always END a section before beginning an new,
eg.

it's always:

A
 some A-section lines.
END A

B
some B-section lines.
END B

etc.

And never:

A
 some A-section lines.
B
some B-section lines.
END B
END A

etc.

is should be fairly simple. And if the file is several GB, your ought
to use a generator in order to overcome the memory problem.

Something like this:

def make_tag_lookup(begin_tags):
  # create a dict with each {begin_tag : end_tag}
  end_tags = [('END ' + begin_tag) for begin_tag in begin_tags]
  return dict(zip(begin_tags, end_tags))

def return_sections(filepath, lookup):
  # Generator returning each section

  inside_section = False

  for line in open(filepath, 'r').readlines():
    line = line.strip()
    if not inside_section:
      if line in lookup:
        inside_section = True
        data_section = []
        section_end_tag = lookup[line]
        section_begin_tag = line
        data_section.append(line) # store section start tag
    else:
      if line == section_end_tag:
        data_section.append(line) # store section end tag
        inside_section = False
        yield data_section # yield entire section

      else:
        data_section.append(line) #store each line within section

# create the generator yielding each section
#
sections = return_sections(datafile,
make_tag_lookup(list_of_begin_tags))

for section in sections:
  for line in section:
    print line
  print '\n'