unexpected behaviour for python regexp: caret symbol almost useless?

Sun May 28 10:56:44 EDT 2006

"conan" <conanelbarbaro at gmail.com> wrote in message
news:1148815719.082410.313870 at 38g2000cwa.googlegroups.com...
> This regexp
> '<widget class=".*" id=".*">'
>
> works well with 'grep' for matching lines of the kind
> <widget class="GtkWindow" id="window1">
>
> on a XML .glade file
>

As Peter Otten has already mentioned, this is the difference between the re
"match" and "search" methods.

As purely a lateral exercise, here is a pyparsing rendition of your program:

------------------------------------
from pyparsing import makeXMLTags, line

# define pyparsing patterns for begin and end XML tags
widgetStart,widgetEnd = makeXMLTags("widget")

# read the file contents
glade_file_name = 'some.glade'
gladeContents = open(glade_file_name).read()

# scan the input string for matching tags
for widget,start,end in widgetStart.scanString(gladeContents):
    print "good:", line(start, gladeContents).strip()
    print widget["class"], widget["id"]
    print "Class: %(class)s; Id: %(id)s" % widget
------------------------------------
Not quite an exact match, only the good lines get listed.  But also check
out some of the other capabilities.  To do this with re's, you have to
clutter up the re expression with field names, as in:

(r'<widget class=(?P<class>".*") id="(?P<id>.*)">')

The parsing patterns generated by makeXMLTags give dict-like and
attribute-like access to any attributes included with the tag.  If not for
the unfortunate attribute name "class" (which is a Python keyword), you
could also reference these values as widget.class and widget.id.

If you are parsing HTML, there is also a makeHTMLTags method, which creates
patterns that are less rigid about upper/lower case and other XML
strictnesses.

-- Paul