Matching XML Tag Contents with Regex

Diez B. Roggisch deets at nospam.web.de
Tue Dec 11 13:08:05 EST 2007


Chris wrote:

> On Dec 11, 11:41 am, garage <xmikeda... at gmail.com> wrote:
>> > Is what I'm trying to do possible with Python's Regex library? Is
>> > there an error in my Regex?
>>
>> Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.
>>
>> To get around the greedy single match, you can add a question mark
>> after the asterisk in the 'content' portion the the markup.  This
>> causes it to take the shortest match, instead of the longest. eg
>>
>> <%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*
>>
>> There's still some funkiness in the regex and logic, but this gives
>> you the three matches
> 
> Thanks, that's pretty close to what I was looking for. How would I
> filter out tags that don't have certain text in the contents? I'm
> running into the same issue again. For instance, if I use the regex:
> 
> <%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
> (tagName)s)]*
> 
> each match will include "targettext". However, some matches will still
> include </%(tagName)s)>, presumably from the tags which didn't contain
> targettext.

Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
access HTML. 

Diez



More information about the Python-list mailing list