Matching XML Tag Contents with Regex
Chris
chrisspen at gmail.com
Tue Dec 11 11:05:29 EST 2007
I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:
import re
data = """
<?xml version='1.0'?>
<body>
<div class='default'>
here's some text!
</div>
<div class='default'>
here's some text!
</div>
<div class='default'>
here's some text!
</div>
</body>
"""
tagName = 'div'
pattern = re.compile('<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*[^(%
(tagName)s)]*' % dict(tagName=tagName))
matches = pattern.finditer(data)
for m in matches:
contents = data[m.start():m.end()]
print repr(contents)
assert tagName not in contents
The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <div> and ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.
Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?
Thanks,
Chris
More information about the Python-list
mailing list