Regexp

MRAB google at mrabarnett.plus.com
Mon Jan 19 09:21:40 EST 2009


gervaz wrote:
> Hi all, I need to find all the address in a html source page, I'm
> using:
> 'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
> b>)?</a>'
> but the [^</a>]+ pattern retrieve all the strings not containing <
> or / or a etc, although I just not want the word "</a>". How can I
> specify: 'do not search the string "blabla"?'
> 
If the name is followed by "<" then just match the name with [^<]+:

href="(?P<url>http://mysite\.com/[^"]+)">(<b>)?(?P<name>[^<]+)(</
 > b>)?</a>

I've also changed mysite.com to mysite\.com because . will match any 
character, but what you probably want to match is ".".



More information about the Python-list mailing list