Python and Regular Expressions

Sun Apr 11 00:32:09 EDT 2010

On Apr 10, 8:38 pm, Paul Rubin <no.em... at nospam.invalid> wrote:
> The impression that I have (from a distance) is that Pyparsing is a good
> interface abstraction with a kludgy and slow implementation.  That the
> implementation uses regexps just goes to show how kludgy it is.  One
> hopes that someday there will be a more serious implementation, perhaps
> using llvm-py (I wonder whatever happened to that project, by the way)
> so that your parser script will compile to executable machine code on
> the fly.

I am definitely flattered that pyparsing stirs up so much interest,
and among such a distinguished group. But I have to take some umbrage
at Paul Rubin's left-handed compliment,  "Pyparsing is a good
interface abstraction with a kludgy and slow implementation,"
especially since he forms his opinions "from a distance".

I actually *did* put some thought into what I wanted in pyparsing
before designing it, and this forms this chapter of "Getting Started
with Pyparsing" (available here as a free online excerpt:
http://my.safaribooksonline.com/9780596514235/what_makes_pyparsing_so_special#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTk3ODA1OTY1MTQyMzUvMTYmaW1hZ2VwYWdlPTE2),
the "Zen of Pyparsing" as it were. My goals were:

- build parsers using explicit constructs (such as words, groups,
repetition, alternatives), vs. expression encoding using specialized
character sequences, as found in regexen

- easy parser construction from primitive elements to complex groups
and alternatives, using Python's operator overloading for ease of
direct implementation of parsers using ordinary Python syntax; include
mechanisms for defining recursive parser expressions

- implicit skipping of whitespace between parser elements

- results returned not just as a list of strings, but as a rich data
object, with access to parsed fields by name or by list index, taking
interfaces from both dicts and lists for natural adoption into common
Python idioms

- no separate code-generation steps, a la lex/yacc

- support for parse-time callbacks, for specialized token handling,
conversion, and/or construction of data structures

- 100% pure Python, to be runnable on any platform that supports
Python

- liberal licensing, to permit easy adoption into any user's projects
anywhere

So raw performance really didn't even make my short-list, beyond the
obvious "should be tolerably fast enough."

I have found myself reading posts on c.l.py with wording like "I'm
trying to parse <blah-blah> and I've been trying for hours/days to get
this regex working."  For kicks, I'd spend 5-15 minutes working up a
working pyparsing solution, which *does* run comparatively slowly,
perhaps taking a few minutes to process the poster's data file.  But
the net solution is developed and running in under 1/2 an hour, which
to me seems like an overall gain compared to hours of fruitless
struggling with backslashes and regex character sequences.  On top of
which, the pyparsing solutions are still readable when I come back to
them weeks or months later, instead of staring at some line-noise
regex and just scratch my head wondering what it was for.  And
sometimes "comparatively slowly" means that it runs 50x slower than a
compiled method that runs in 0.02 seconds - that's still getting the
job done in just 1 second.

And is the internal use of regexes with pyparsing really a "kludge"?
Why? They are almost completely hidden from the parser developer. And
yet by using compiled regexes, I retain the portability of 100% Python
while leveraging the compiled speed of the re engine.

It does seem that there have been many posts of late (either on c.l.py
or the related posts on Stackoverflow) where the OP is trying to
either scrape content from HTML, or parse some type of recursive
expression.  HTML scrapers implemented using re's are terribly
fragile, since HTML in the wild often contains little surprises
(unexpected whitespace; upper/lower case inconsistencies; tag
attributes in unpredictable order; attribute values with double,
single, or no quotation marks) which completely frustrate any re-based
approach.  Granted, there are times when an re-parsing-of-HTML
endeavor *isn't* futile or doomed from the start - the OP may be
working with a very restricted set of HTML, generated from some other
script so that the output is very consistent. Unfortunately, this
poster usually gets thrown under the same "you'll never be able to
parse HTML with re's" bus. I can't explain the surge in these posts,
other than to wonder if we aren't just seeing a skewed sample - that
is, the many cases where people *are* successfully using re's to solve
their text extraction problems aren't getting posted to c.l.py, since
no one posts questions they already have the answers to.

So don't be too dismissive of pyparsing, Mr. Rubin. I've gotten many e-
mails, wiki, and forum posts from Python users at all levels of the
expertise scale, saying that pyparsing has helped them to be very
productive in one or another aspect of creating a command parser, or
adding safe expression evaluation to an app, or just extracting some
specific data from a log file. I am encouraged that most report that
they can get their parsers working in reasonably short order, often by
reworking one of the examples that comes with pyparsing.  If you're
offering to write that extension to pyparsing that generates the
parser runtime in fast machine code, it sounds totally bitchin' and
I'd be happy to include it when it's ready.

-- Paul