Saving search results in a dictionary - why is pyparsing better than regexp?
Jean Brouwers
mrjean1 at comcast.net
Fri Jun 18 19:41:28 EDT 2004
If you need a fast parser in Python, try SimpleParse (mxTextTools). We
use it to parse and process large log files, 100+MB in size.
<http://members.rogers.com/mcfletch/programming/simpleparse/simpleparse.
html>
An example, the run time for the parsing step alone with a simple but
non-trivial grammar is comparable to grep. Total run time is dominated
by the processing step and increased formore complex grammars,
obviously.
/Jean Brouwers
ProphICy Semiconductor, Inc.
In article <dsJAc.1740$1L2.1400 at fe1.texas.rr.com>, Paul McGuire
<ptmcg at austin.rr._bogus_.com> wrote:
> "Lukas Holcik" <xholcik1 at fi.muni.cz> wrote in message
> news:Pine.LNX.4.60.0406181448190.19022 at nymfe30.fi.muni.cz...
> > Hi Paul and thanks for reply,
> >
> > Why is the pyparsing module better than re? Just a question I must ask
> > before I can use it. Meant with no offense. I found an extra pdf howto on
> > python.org about regexps and found out, that there is an object called
> > finditer, which could accomplish this task quite easily:
> >
> > regexp = re.compile("<a href=\"(?P<href>.*?)\">(?P<pcdata>.*?)</a>",
> \
> > re.I)
> > iterator = regexp.finditer(text)
> > for match in iterator:
> > dict[match.group("pcdata")] = match.group("href")
> >
> > ---------------------------------------_.)--
> > | Lukas Holcik (xholcik1 at fi.muni.cz) (\=)*
> > ----------------------------------------''--
> >
> <snip>
>
> Lukas -
>
> A reasonable question, no offense taken. :)
>
> Well, I'm not sure I'd say pyparsing was "better" than re - maybe "easier"
> or "easier to read" or "easier to maintain" or "easier for those who don't
> do regexp's frequently enough to have all the re symbols memorized". And
> I'd be the first to admit that pyparsing is slow at runtime. I would also
> tell you that I am far from being a regexp expert, having had to delve into
> them only 3 or 4 times in the past 10 years (and consequently re-learn them
> each time).
>
> On the other hand, pyparsing does do some things to simplify your life. For
> instance, there are a number of valid HTML anchors that the re you listed
> above would miss. First of all, whitespace has a funny way of cropping up
> in unexpected places, such as between 'href' and '=', or between '=' and the
> leading ", or in the closing /a tag as "< /a >". What often starts out as a
> fairly clean-looking regexp such as you posted quickly becomes mangled with
> markers for optional whitespace. (Although I guess there *is* a magic
> re tag to indicate that whitespace between tokens may or may not be
> there...)
>
> Comments are another element that can confound well-intentioned regexp
> efforts. The pyparsing example that I gave earlier did not handle HTML
> comments, but to do so, you would define an HTML comment element, and
> then add the statement:
> link.ignore( htmlComment )
> (pyparsing includes a comment definition for C-style block comments of
> the /* */ variety - maybe adding an HTML comment definition would be
> useful?) What would the above regexp look like to handle embedded HTML
> comments?
>
> In the sample I posted earlier, extracting URL refs from www.yahoo.com, a
> number of href's were *not* inside quoted strings - how quickly could the
> above regexp be modified to handle this?
>
> Doesn't the .* only match non-white characters? Does the above regexp
> handle hrefs that are quoted strings with embedded spaces? What about
> pcdata with embedded spaces? (Sorry if my re ignorance is showing here.)
>
> Lastly, pyparsing does handle some problems that regexp's can't, most
> notable those that have some recursive definition, such as algebraic infix
> notation, or EBNF. Working samples of both of these are included in the
> sample that come with pyparsing. (There are other parsers out there other
> than pyparsing, too, that can do this same job.)
>
> pyparsing's runtime performance is pretty slow, positively glacial compared
> to compiled regexp's or string splits. I've warned away some potential
> pyparsing users who had *very*clean input data (no hand-edited input text,
> very stable and simple input format) that used string split() to run 50X
> faster than pyparsing. This was a good exercise for me, I used the hotshot
> profiler to remove 30-40% of the runtime, but I was still far shy of the
> much-speedier string splitting algorithm. But again, this application had
> *very* clean input data, with a straightforward format. He also had a very
> demanding runtime performance criterion, having to read and process about
> 50,000 data records at startup - string.split() took about 0.08 seconds,
> pyparsing took about 5 seconds. My recommendation was to *not* use
> pyparsing in this case.
>
> On the other hand, for simple one-off's, or for functions that are not time-
> critical parts of a program, or if it doesn't matter if the program takes 10
> minutes to write and 30 seconds to run (with say, pyparsing) vs. 15 minutes
> to write and 0.5 seconds to run (with say, regexp's), I'd say pyparsing was
> a good choice. And when you find you need to add or extend a given
> parsing construct, it is usually a very straightforward process with
> pyparsing.
>
> I've had a number of e-mails telling me how pleasant and intuitive it is to
> work with pyparsing, in some ways reminiscent of the "I like coding in
> Python, even if it is slower than C at runtime" comments we read in c.l.py
> every week (along with many expositions on how raw runtime performance
> is not always the best indicator of what solution is the "best").
>
> Just as David Mertz describes in his Text Processing with Python book,
> each of these are just one of many tools in our toolkits. Don't get more
> complicated in your solution than you need to be. The person who, in 6
> months, needs to try to figure out just how the heck your code works,
> just might be you.
>
> Sorry for the length of response, hope some of you are still awake...
>
> -- Paul
>
>
>
>
More information about the Python-list
mailing list