[Tutor] Regular Expresions

Mon Mar 1 17:31:33 EST 2004

1st post... hi! :-)

I am trying to rip the href's from html documents, I've found a regular 
expression to do this, but it's for .net, and seems to use a slightly 
different format from python.

The RegEx is in two parts the first part was

href\s*=\s*\"(?<url>[^\"]*)\"

Ive found that Python needs a 'P' before the label, i.e,

href\s*=\s*\"(?P<url>[^\"]*)\"

However the full regEx is,

href\s*=\s*(?:(?:\"(?<url>[^\"]*)\")|(?<url>[^\s*] ))

Which uses a 'non-capturing group' to combine the two regEx cases (in 
this case quotes or no quotes).  I would be very grateful if someone 
could point me in the right direction on how to achieve the same effect 
in Python, as its complaining about the groups being defined twice 
rather than merging them into one 'url' group...

Thanks