splitting words with brackets

Thu Jul 27 23:30:01 EDT 2006

 >> Hunh!  I thought pyparsing was included with Debian.
 >> (http://packages.debian.org/stable/source/pyparsing)

Yes, it's available.  Laziness is the main factor
here...however, it's simply an "apt-get install pyparsing"
away.

 >> And is downloading a package really such a hardship?
 >> What, are you on dialup?

As a matter of fact, at home, yes.  As the saying goes, the
cobbler's family has no shoes (and the geek's family has no
broadband :)  For $6.50/month, when most of what I do is
email, downloading comics, and a little web-surfing for geek
news, it does the job.  Local broadband runs 3-5 times that
price around here for shoddy, intermittant service.

 >> complications just fell away, and a much simpler approach
 >> emerged - giving rise to an unavoidable "Why didn't you
 >> just say so in the first place?" experience.

Heh, if they could have concisely expressed what they
wanted, they would likely have been sufficiently competant
to write the regexp/parser in the first place.  :)

 >> It is difficult to come up with an re to compare with
 >> pyparsing's makeHTMLTags("A"), which handles:

The biggest difficulty with REs that I've found is that it's
next to impossible to work with balanced items (such as
parens or brackets) nested to an arbitrary depth.  Pyparsing
makes this almost a trivial no-brainer.

 >> - case insensitive tag matching ("A" or "a" not so big a
 >> deal, but some tags have more than one letter)
 >> - any attributes, in any order
 >> - attribute values in single quotes, double quotes, or no
 >> quotes
 >> - attribute values are returned as named fields,
 >> converting attrib names to lower case
 >> - optional embedded '/' signifying a combined opening and
 >> closing tag, that is "<AA foo='bar'/>" is the same as
 >> "<AA foo='bar'></AA>" (more common in XML than HTML,
 >> really)
 >> - whitespace just about anywhere
 >>
 >> Hey, if you could make up an re factory function to do
 >> all that for any given tagname, you could post that to
 >> the Python Cookbook and be a real hero!

I think the biggest difficulty is pathological cases such as
bad older HTML where something like

     <a title="This is > that">

where the unescaped entity (which in xhtml/xml should be <a
title="This is > that">) causes grief.  However, if
you've got a sample-set of problematic text to try and
parse, I'd gladly take a whack at creating a regexp to parse
it:

import re
examples = [
     'x1 < a href = blah src="foo" bar=\'baz\'> x2 </a> x3',
     'stuff1 <xhtml:img src="blah.gif"/> stuff2',
     """stuff1 <xhtml:img
                src="blah.gif"
                /> stuff2""",
     ]
r = re.compile(r"""
     <
     \s*
     # the tag name
     ([a-zA-Z_][-a-zA-Z0-9._]*
         # with optional namespace stuff
         (?: :[a-zA-Z_][-a-zA-Z0-9._]*)?)
     \s*
     (
         (?:
             # the attribute
             (?:[a-zA-Z_][-a-zA-Z0-9._]*
                 # with optional namespace stuff
                 (?: :[a-zA-Z_][-a-zA-Z0-9._]*)?)
             \s*
             =
             \s*
             (?:   # the attribute value
                     "[^"]*"|
                     '[^']*'|
                     [-.a-zA-Z0-9_/]+)
         \s*?
         )*
     )
     \s*
     /?
     \s*
     >
""", re.VERBOSE)

for test in examples:
     print repr(r.findall(test))

I'm sure there are characters that W3C defines as allowable
for tag names and the like that are skipped over here, but
the regexp should be fairly well commented here.

However, if you find some test cases that choke it, send 'em
my way and I'd be glad to massage the above to incorporate
them

 > And thanks for responding, I really do appreciate
 > discussing his stuff - I get better ideas of what I'm
 > doing, and on occasion why I'm doing it.

As they say, the better you understand it, the better you
can defend your decisions to use it.  When all you have is a
hammer (regexps or pyparsing alone), all the world looks
like a nail.  I wasn't convinced about pyparsing for a
while, when regexps were readily available and so default to
me (being a vim/sed user, they're part of everyday life).
Until I came across several classes of problem for which
regexps grew big and unweildy quite quickly, but that a
lex/yacc/parsing sort of solution solved elegantly--and more
importantly, in an understandible fashion.

 >>>> > Ah, it's exactly what I want!  I thought the left and
 >>>> > right sides of "|" are equal, but it is not true.
 >>
 >>>
 >>> In theory, they *should* be equal. I was baffled by the
 >>> nonparity of the situation.  You *should" be able to
 >>> swap the two sides of the "|" and have it treated the
 >>> same.  Yet, when I tried it with the above regexp,
 >>> putting the \S first, it seemed to choke and give
 >>> different results.  I'd love to know why.
 >>>
 >
 > Does the re do left-to-right matching?  If so, then the \S
 > will eat the opening parens/brackets, and never get into
 > the other alternative patterns.  \S is the most
 > "matchable" pattern, so if it comes ahead of the other
 > alternatives, then it will always be the one matched.  My
 > guess is that if you put \S first, you will only get the
 > contiguous character groups, regardless of ()'s and []'s.
 > The expression might as well just be \S+.

My understanding was that, with REs, it should try for the
longest match, even it if means backtracking to previously
possible patterns.  It's how it seems to work in Vim's
regexps.  Same with sed's.  I'm not sure why Python's REs
don't behave the same way. :(

 >>-tkc
 >
 > -- Paul   I put *two* dashes in front of my sig. :)

And a space!  A regular John Handcock of extravagance there.
:)  Laziness has me using a single dash with no space, and
using "tkc" rather than "tim" as there are already far too
many Tim's on the list.

-tkc