Saving search results in a dictionary - why is pyparsing better than regexp?

Jean Brouwers mrjean1 at comcast.net
Fri Jun 18 19:41:28 EDT 2004


If you need a fast parser in Python, try SimpleParse (mxTextTools).  We
use it to parse and process large log files, 100+MB in size.

<http://members.rogers.com/mcfletch/programming/simpleparse/simpleparse.
html>

An example, the run time for the parsing step alone with a simple but
non-trivial grammar is comparable to grep.  Total run time is dominated
by the processing step and increased formore complex grammars,
obviously.

/Jean Brouwers
 ProphICy Semiconductor, Inc.


In article <dsJAc.1740$1L2.1400 at fe1.texas.rr.com>, Paul McGuire
<ptmcg at austin.rr._bogus_.com> wrote:

> "Lukas Holcik" <xholcik1 at fi.muni.cz> wrote in message
> news:Pine.LNX.4.60.0406181448190.19022 at nymfe30.fi.muni.cz...
> > Hi Paul and thanks for reply,
> >
> > Why is the pyparsing module better than re? Just a question I must ask
> > before I can use it. Meant with no offense. I found an extra pdf howto on
> > python.org about regexps and found out, that there is an object called
> > finditer, which could accomplish this task quite easily:
> >
> >      regexp = re.compile("<a href=\"(?P<href>.*?)\">(?P<pcdata>.*?)</a>",
> \
> >   re.I)
> >      iterator = regexp.finditer(text)
> >      for match in iterator:
> >          dict[match.group("pcdata")] = match.group("href")
> >
> > ---------------------------------------_.)--
> > |  Lukas Holcik (xholcik1 at fi.muni.cz)  (\=)*
> > ----------------------------------------''--
> >
> <snip>
> 
> Lukas -
> 
> A reasonable question, no offense taken. :)
> 
> Well, I'm not sure I'd say pyparsing was "better" than re - maybe "easier"
> or "easier to read" or "easier to maintain" or "easier for those who don't
> do regexp's frequently enough to have all the re symbols memorized".  And
> I'd be the first to admit that pyparsing is slow at runtime.  I would also
> tell you that I am far from being a regexp expert, having had to delve into
> them only 3 or 4 times in the past 10 years (and consequently re-learn them
> each time).
> 
> On the other hand, pyparsing does do some things to simplify your life.  For
> instance, there are a number of valid HTML anchors that the re you listed
> above would miss.  First of all, whitespace has a funny way of cropping up
> in unexpected places, such as between 'href' and '=', or between '=' and the
> leading ", or in the closing /a tag as "< /a >".  What often starts out as a
> fairly clean-looking regexp such as you posted quickly becomes mangled with
> markers for optional whitespace.  (Although I guess there *is* a magic
> re tag to indicate that whitespace between tokens may or may not be
> there...)
> 
> Comments are another element that can confound well-intentioned regexp
> efforts.  The pyparsing example that I gave earlier did not handle HTML
> comments, but to do so, you would define an HTML comment element, and
> then add the statement:
>     link.ignore( htmlComment )
> (pyparsing includes a comment definition for C-style block comments of
> the /* */ variety - maybe adding an HTML comment definition would be
> useful?)  What would the above regexp look like to handle embedded HTML
> comments?
> 
> In the sample I posted earlier, extracting URL refs from www.yahoo.com, a
> number of href's were *not* inside quoted strings - how quickly could the
> above regexp be modified to handle this?
> 
> Doesn't the .* only match non-white characters?  Does the above regexp
> handle hrefs that are quoted strings with embedded spaces?  What about
> pcdata with embedded spaces?  (Sorry if my re ignorance is showing here.)
> 
> Lastly, pyparsing does handle some problems that regexp's can't, most
> notable those that have some recursive definition, such as algebraic infix
> notation, or EBNF.  Working samples of both of these are included in the
> sample that come with pyparsing.  (There are other parsers out there other
> than pyparsing, too, that can do this same job.)
> 
> pyparsing's runtime performance is pretty slow, positively glacial compared
> to compiled regexp's or string splits.  I've warned away some potential
> pyparsing users who had *very*clean input data (no hand-edited input text,
> very stable and simple input format) that used string split() to run 50X
> faster than pyparsing.  This was a good exercise for me, I used the hotshot
> profiler to remove 30-40% of the runtime, but I was still far shy of the
> much-speedier string splitting algorithm.  But again, this application had
> *very* clean input data, with a straightforward format.  He also had a very
> demanding runtime performance criterion, having to read and process about
> 50,000 data records at startup - string.split() took about 0.08 seconds,
> pyparsing took about 5 seconds.  My recommendation was to *not* use
> pyparsing in this case.
> 
> On the other hand, for simple one-off's, or for functions that are not time-
> critical parts of a program, or if it doesn't matter if the program takes 10
> minutes to write and 30 seconds to run (with say, pyparsing) vs. 15 minutes
> to write and 0.5 seconds to run (with say, regexp's), I'd say pyparsing was
> a good choice.  And when you find you need to add or extend a given
> parsing construct, it is usually a very straightforward process with
> pyparsing.
> 
> I've had a number of e-mails telling me how pleasant and intuitive it is to
> work with pyparsing, in some ways reminiscent of the "I like coding in
> Python, even if it is slower than C at runtime" comments we read in c.l.py
> every week (along with many expositions on how raw runtime performance
> is not always the best indicator of what solution is the "best").
> 
> Just as David Mertz describes in his Text Processing with Python book,
> each of these are just one of many tools in our toolkits.  Don't get more
> complicated in your solution than you need to be.  The person who, in 6
> months, needs to try to figure out just how the heck your code works,
> just might be you.
> 
> Sorry for the length of response, hope some of you are still awake...
> 
> -- Paul
> 
> 
> 
>



More information about the Python-list mailing list