File parser

Tue Aug 30 11:40:11 EDT 2005

Angelic Devil wrote:
> I'm building a file parser but I have a problem I'm not sure how to
> solve.  The files this will parse have the potential to be huge
> (multiple GBs).  There are distinct sections of the file that I
> want to read into separate dictionaries to perform different
> operations on.  Each section has specific begin and end statements
> like the following:
>
> KEYWORD
> .
> .
> .
> END KEYWORD
>
> The very first thing I do is read the entire file contents into a
> string.  I then store the contents in a list, splitting on line ends
> as follows:
>
>
>     file_lines = file_contents.split('\n')
>
>
> Next, I build smaller lists from the different sections using the
> begin and end keywords:
>
>
>     begin_index = file_lines.index(begin_keyword)
>     end_index = file_lines.index(end_keyword)
>     small_list = [ file_lines[begin_index + 1] : file_lines[end_index - 1] ]
>
>
> I then plan on parsing each list to build the different dictionaries.
> The problem is that one begin statement is a substring of another
> begin statement as in the following example:
>
>
> BAR
> END BAR
>
> FOOBAR
> END FOOBAR
>
>
> I can't just look for the line in the list that contains BAR because
> FOOBAR might come first in the list.  My list would then look like
>
> [foobar_1, foobar_2, ..., foobar_n, ..., bar_1, bar_2, ..., bar_m]
>
> I don't really want to use regular expressions, but I don't see a way
> to get around this without doing so.  Does anyone have any suggestions
> on how to accomplish this? If regexps are the way to go, is there an
> efficient way to parse the contents of a potentially large list using
> regular expressions?
>
> Any help is appreciated!
>
> Thanks,
> Aaron

Some time ago I was toying around with writing a tool in python to
parse our VB6 code (the original idea was to write our own .NET
conversion tool because the Wizard that comes with VS.NET sucks hard on
some things).  I tried various parsing tools and EBNF grammars but VB6
isn't really an EBNF-esque syntax in all cases, so I needed something
else.  VB6 syntax is similar to what you have, with all kinds of
different "Begin/End" blocks, and some files can be rather big.  Also,
when you get to conditionals and looping constructs you can have
seriously nested logic, so the approach I took was to imitate a SAX
parser.  I created a class that reads VB6 source line by line, and
calls empty "event handler" methods (just like SAX) such as
self.begin_type or self.begin_procedure and self.end_type or
self.end_procedure.  Then I created a subclass that actually
implemented those event handlers by building a sort of tree that
represents the program in a more abstract fashion.  I never got to the
point of writing the tree out in a new language, but I had fun hacking
on the project for a while.  I think a similar approach could work for
you here.