Saving search results in a dictionary - why is pyparsing better than regexp?

Fri Jun 18 17:50:01 EDT 2004

"Lukas Holcik" <xholcik1 at fi.muni.cz> wrote in message
news:Pine.LNX.4.60.0406181448190.19022 at nymfe30.fi.muni.cz...
> Hi Paul and thanks for reply,
>
> Why is the pyparsing module better than re? Just a question I must ask
> before I can use it. Meant with no offense. I found an extra pdf howto on
> python.org about regexps and found out, that there is an object called
> finditer, which could accomplish this task quite easily:
>
>      regexp = re.compile("<a href=\"(?P<href>.*?)\">(?P<pcdata>.*?)</a>",
\
>   re.I)
>      iterator = regexp.finditer(text)
>      for match in iterator:
>          dict[match.group("pcdata")] = match.group("href")
>
> ---------------------------------------_.)--
> |  Lukas Holcik (xholcik1 at fi.muni.cz)  (\=)*
> ----------------------------------------''--
>
<snip>

Lukas -

A reasonable question, no offense taken. :)

Well, I'm not sure I'd say pyparsing was "better" than re - maybe "easier"
or "easier to read" or "easier to maintain" or "easier for those who don't
do regexp's frequently enough to have all the re symbols memorized".  And
I'd be the first to admit that pyparsing is slow at runtime.  I would also
tell you that I am far from being a regexp expert, having had to delve into
them only 3 or 4 times in the past 10 years (and consequently re-learn them
each time).

On the other hand, pyparsing does do some things to simplify your life.  For
instance, there are a number of valid HTML anchors that the re you listed
above would miss.  First of all, whitespace has a funny way of cropping up
in unexpected places, such as between 'href' and '=', or between '=' and the
leading ", or in the closing /a tag as "< /a >".  What often starts out as a
fairly clean-looking regexp such as you posted quickly becomes mangled with
markers for optional whitespace.  (Although I guess there *is* a magic
re tag to indicate that whitespace between tokens may or may not be
there...)

Comments are another element that can confound well-intentioned regexp
efforts.  The pyparsing example that I gave earlier did not handle HTML
comments, but to do so, you would define an HTML comment element, and
then add the statement:
    link.ignore( htmlComment )
(pyparsing includes a comment definition for C-style block comments of
the /* */ variety - maybe adding an HTML comment definition would be
useful?)  What would the above regexp look like to handle embedded HTML
comments?

In the sample I posted earlier, extracting URL refs from www.yahoo.com, a
number of href's were *not* inside quoted strings - how quickly could the
above regexp be modified to handle this?

Doesn't the .* only match non-white characters?  Does the above regexp
handle hrefs that are quoted strings with embedded spaces?  What about
pcdata with embedded spaces?  (Sorry if my re ignorance is showing here.)

Lastly, pyparsing does handle some problems that regexp's can't, most
notable those that have some recursive definition, such as algebraic infix
notation, or EBNF.  Working samples of both of these are included in the
sample that come with pyparsing.  (There are other parsers out there other
than pyparsing, too, that can do this same job.)

pyparsing's runtime performance is pretty slow, positively glacial compared
to compiled regexp's or string splits.  I've warned away some potential
pyparsing users who had *very*clean input data (no hand-edited input text,
very stable and simple input format) that used string split() to run 50X
faster than pyparsing.  This was a good exercise for me, I used the hotshot
profiler to remove 30-40% of the runtime, but I was still far shy of the
much-speedier string splitting algorithm.  But again, this application had
*very* clean input data, with a straightforward format.  He also had a very
demanding runtime performance criterion, having to read and process about
50,000 data records at startup - string.split() took about 0.08 seconds,
pyparsing took about 5 seconds.  My recommendation was to *not* use
pyparsing in this case.

On the other hand, for simple one-off's, or for functions that are not time-
critical parts of a program, or if it doesn't matter if the program takes 10
minutes to write and 30 seconds to run (with say, pyparsing) vs. 15 minutes
to write and 0.5 seconds to run (with say, regexp's), I'd say pyparsing was
a good choice.  And when you find you need to add or extend a given
parsing construct, it is usually a very straightforward process with
pyparsing.

I've had a number of e-mails telling me how pleasant and intuitive it is to
work with pyparsing, in some ways reminiscent of the "I like coding in
Python, even if it is slower than C at runtime" comments we read in c.l.py
every week (along with many expositions on how raw runtime performance
is not always the best indicator of what solution is the "best").

Just as David Mertz describes in his Text Processing with Python book,
each of these are just one of many tools in our toolkits.  Don't get more
complicated in your solution than you need to be.  The person who, in 6
months, needs to try to figure out just how the heck your code works,
just might be you.

Sorry for the length of response, hope some of you are still awake...

-- Paul