Matching XML Tag Contents with Regex

Chris chrisspen at gmail.com
Tue Dec 11 11:05:29 EST 2007


I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:

import re

data = """
<?xml version='1.0'?>
<body>
<div class='default'>
here's some text!
</div>
<div class='default'>
here's some text!
</div>
<div class='default'>
here's some text!
</div>
</body>
"""

tagName = 'div'
pattern = re.compile('<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*[^(%
(tagName)s)]*' % dict(tagName=tagName))

matches = pattern.finditer(data)
for m in matches:
    contents = data[m.start():m.end()]
    print repr(contents)
    assert tagName not in contents

The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <div> and ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Thanks,
Chris



More information about the Python-list mailing list