Matching XML Tag Contents with Regex

Tue Dec 11 12:33:20 EST 2007

On Dec 11, 4:05 pm, Chris <chriss... at gmail.com> wrote:
> I'm trying to find the contents of an XML tag. Nothing fancy. I don't
> care about parsing child tags or anything. I just want to get the raw
> text. Here's my script:
>
> import re
>
> data = """
> <?xml version='1.0'?>
> <body>
> <div class='default'>
> here's some text!
> </div>
> <div class='default'>
> here's some text!
> </div>
> <div class='default'>
> here's some text!
> </div>
> </body>
> """
>
> tagName = 'div'
> pattern = re.compile('<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*[^(%
> (tagName)s)]*' % dict(tagName=tagName))
>
> matches = pattern.finditer(data)
> for m in matches:
>     contents = data[m.start():m.end()]
>     print repr(contents)
>     assert tagName not in contents
>
> The problem I'm running into is that the [^%(tagName)s]* portion of my
> regex is being ignored, so only one match is being returned, starting
> at the first <div> and ending at the end of the text, when it should
> end at the first </div>. For this example, it should return three
> matches, one for each div.
>
> Is what I'm trying to do possible with Python's Regex library? Is
> there an error in my Regex?
>
> Thanks,
> Chris

print re.findall(r'<%s(?=[\s/>])[^>]*>' % 'div', r)

["<div class='default'>", "<div class='default'>", "<div
class='default'>"]

HTH

Harvey