Matching XML Tag Contents with Regex

Tue Dec 11 17:31:22 EST 2007

On Dec 11, 1:08 pm, "Diez B. Roggisch" <de... at nospam.web.de> wrote:
> Chris wrote:
> > On Dec 11, 11:41 am, garage <xmikeda... at gmail.com> wrote:
> >> > Is what I'm trying to do possible with Python's Regex library? Is
> >> > there an error in my Regex?
>
> >> Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.
>
> >> To get around the greedy single match, you can add a question mark
> >> after the asterisk in the 'content' portion the the markup.  This
> >> causes it to take the shortest match, instead of the longest. eg
>
> >> <%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*
>
> >> There's still some funkiness in the regex and logic, but this gives
> >> you the three matches
>
> > Thanks, that's pretty close to what I was looking for. How would I
> > filter out tags that don't have certain text in the contents? I'm
> > running into the same issue again. For instance, if I use the regex:
>
> > <%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
> > (tagName)s)]*
>
> > each match will include "targettext". However, some matches will still
> > include </%(tagName)s)>, presumably from the tags which didn't contain
> > targettext.
>
> Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
> access HTML.
>
> Diez

I was hoping a simple pattern like <tag>.*text.*</tag> wouldn't be too
complicated for Regex, but now I'm starting to agree with you. Parsing
the entire XML Dom would probably be a lot easier.