unexpected behaviour for python regexp: caret symbol almost useless?

conan conanelbarbaro at gmail.com
Sun May 28 07:28:39 EDT 2006


This regexp
'<widget class=".*" id=".*">'

works well with 'grep' for matching lines of the kind
<widget class="GtkWindow" id="window1">

on a XML .glade file

However that's not true for the re module in python, since this one
takes the regexp as if were specified this way: '^<widget class=".*"
id=".*">'

For some reason regexp on python decide to match from the start of the
line, no matter if you used or not the caret symbol '^'.

I have a hard time to note why this regexp wasn't working:
regexp = re.compile(r'<widget class=".*" id="(.*)">')

The solution was to consider spaces:
regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

To reproduce behaviour just take a .glade file and this python script:
<code>
import re

glade_file_name = 'some.glade'

bad_regexp = re.compile(r'<widget class=".*" id="(.*)">')
good_regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

for line in open(glade_file_name):
    if bad_regexp.match(line):
        print 'bad:', line.strip()
    if good_regexp.match(line):
        print 'good:', line.strip()
</code>

The thing is i should expected to have to put caret explicitly to tell
the regexp to match at the start of the line, something like:
r'^<widget class=".*" id="(.*)">'
however python regexp is taking care of that for me. This is not a
desired behaviour for what i know about regexp, but maybe i'm missing
something.




More information about the Python-list mailing list