Newbie programmer question: How do parsers work?(Python examples?)

Fri Aug 25 16:24:22 EDT 2006

You have a lot of choices with this sort of thing.  What you'd use
depends largely on what sorts of files/input you'll be parsing.

For example, a common machine-friendly data format is the
comma-separated file.  These, or really any file which uses a
character-based field seperator (including newline characters), are
usually best read with something like split([seperator]), which will
return an array of each element in the string you give it.  Example:

>>> str='Brian,student,California,555-0127'
>>> tokens = str.split(',')
>>> print tokens
['Brian', 'student', 'California', '555-0127']

More helpful is if the file has a header line:

>>> str='Name,Occupation,Location,Phone\n'+\
... 'Brian,student,California,555-0127\n'+\
... 'Ann,carpenter,Georgia,555-3825'
>>> entries=str.split('\n')
>>> print entries
['Name,Occupation,Location,Phone', 'Brian,student,California,555-0127',
'Ann,carpenter,Georgia,555-3825']
>>> header=entries[0]
>>> entries=entries[1:]
>>> for entry in entries:
...     tokens=entry.split(',')
...     [whatever]

A more powerful tool is the regular expression engine, which is
something you'll be using quite a lot if you get into heavy text
parsing.  Some people have described it as its own mini-language, but
by no means is it Python specific: Perl, Java, various Unix shells, and
others all have a roughly equivalent setup.

Python regular expression engine is very object-oriented.  As a simple
primer, you first import the re module, make a Pattern object from
re.compile(), and then run one of the Pattern's several ways of parsing
a line.  A common example, where you want to know the version of the
Globus software installed from reading a filename:

>>> import re
>>> str='/users/username/globus/globus-4.0.2/lib/libglobus_gridftp_server_gcc32.so'
>>> version = re.compile("\d.\d.\d")
>>> print version
<_sre.SRE_Pattern object at 0x400329b0>
>>> version.search(str)
<_sre.SRE_Match object at 0x40075218>
>>> version.search(str).group()
'4.0.2'

This seems a bit overblown just to find this - after all, we could have
just split str on '/' to make a token array, grabbed token 4, split
again on '-', and taken token 1.  The advantage to regular expressions
is that they're very flexible.  This would work on any of the
following:

/users/username/globus/globus-4.0.2/lib/libglobus_gridftp_server_gcc32.so
../../globus-4.0.2/lib/libglobus_gridftp_server_gcc32.so
ftp ftp.server.com -e 'get
globus/globus-4.0.2/lib/libglobus_gridftp_server_gcc32.so

and so on.  The moral is that when the string you're parsing is fairly
regular, use something like split(), when it can vary a lot, use
regular expressions.  Split is, as you may expect, quite a bit faster.

I should stress that this is a very barebones example, and doesn't even
begin to scratch the surface of regular expressions' power.  It's also
a little too general, as any string fragment of
[number].[number].[number] will match here.  An excellent resource on
regular expressions in Python (I believe lifted from the original
documentation, but I digress):

http://www.amk.ca/python/howto/regex/

XML is another common format to have to go through, but I don't have
much experience in this area.  If memory serves, Python comes with a
built-in XML parser that makes a multi-level dictionary of any XML file
you give it.  Hopefully others can fill in on that part.

Also, don't be afraid of having the interpreter open next to your
editor of choice, and of running test patterns through any parsing code
you're writing.  Regular expressions in particular are very easy to
screw up, no matter how long you've been using them.

bio_enthusiast wrote:
> I was wondering exactly how you create a parser. I'm learning
> Python and I recently have come across this material. I'm interested
> in the method or art of writing a parser.
>
> If anyone has some python code to post for an abstract parser, or links
> to some informative tutorials, that would be great.