ANN: Martel-0.2

Thu Aug 24 20:40:18 EDT 2000

[Fredrik Lundh, Greg Ewing and Marc-Andre Lemburg may be interested
 in this post because of what I've done with their ideas and code.]

Hello,

  This is my first announcement on this newsgroup for a project I've
been working on called Martel.  It's a parser generator for (nearly)
regular language.  Detailed information can be found in me recent
conference poster under http://www.biopython.org/~dalke/Martel/

  It is designed to handle many of the formats I need to parse --
database records and program output -- which are stateful but not
very complex.  The are stateful, meaning if the parsing was split
up between a lexer and a parser, then there would be a lot of
communication between them so the lexer can tokenize correctly.
They aren't complex meaning they don't have balanced parens or other
types of data structures with indefinite depth.

  Briefly, it uses a modified subset of Python's re expression
syntax as the format description, and uses it to build the parser.
The parser takes an input string and makes and expression tree for
it.  The tree is passed back to the caller using the SAX events
from XML parsing, where the (?P<named>groups) are used to define
the startElement() and endElement() names, and the leaves of the
tree become the characters().  I need a briefer description than
this, but haven't figured one out which is still understandable.
I'm not even sure that's understandable :)

  Technically, the regular expression parsing is done with a modified
version of Fredrik Lundh's sre_parse from 1.6a3(or 2?).  The
modifications allow access to all groups with the same name (instead
of just the last one) and allow group name identifiers to have the
same syntax as XML tag names.  I also added support a new language
syntax which I call "named group repeats" where '{}'s allows a string
name inside, which is used as the repeat count.  For example:
 r"Num atoms = (?P<num_atoms>\d+)\n((?P<atom_name>\w+)\n){num_atoms}"

  Building up regular expression strings as strings is error-prone,
so I convert the regular expression output into an Expression tree.
Expressions can be combined and otherwise manipulated using many of
the same functions as Greg Ewing's Plex.

  The parser is built by making a tag table for Marc-Andre Lemburg's
mxTextTools, which does the actual parsing.  I did have to add some
hacks for things like lookahead assertions and named group references.
(The last, for example, doesn't support multiple threads.)

  The resulting system is very cool, if I say so myself :)  Hmm,
looks like I need an example.  If you know the SWISS-PROT format, then
the examples on my aforementioned poster should be helpful.  I guess
I should come up with one which is a little less domain specific.
Umm, how about the following untested code:

from Martel import *
def word(name):
  return Group(name, Re("[^:\n]*"))

format = Rep(word("name") + Str(":") + \
             Alt( Group("no_passwd", Re("(?!\w{13})[^:]")),
                  word("passwd")
                ) + Str(":") + \
             word("uid") + Str(":") + \
             word("gid") + Str(":") + \
             word("homedir") + Str(":") + \
             word("shell") + Str("\n")
            )

Making the parser and tying in the handler(s) then parsing the string
root:gEMPivloT9av8:0:0:System Account:/root:/bin/sh
idle:x:10:66:Eric Idle (disabled):/home/idle:/bin/noshell

gives calls like:
startElement("name")
 characters("root")
endElement("name")
characters(":")
startElement("passwd")
 characters("gEMPivloT9av8")
endElement("passwd")
 ...
characters("\n")
startElement("name")
 characters("idle")
endElement("name")
characters(":")
startElement("no_passwd")
 characters("x")
endElement("no_passwd")
 ...

But that's not very interesting, since it doesn't show that I can parse
optional fields, and even do format and version detection.  Oh well,
you should learn more about bioinformatics anyway!

                    Andrew
                    dalke at acm.org