Regexp

Mon Jan 19 11:48:07 EST 2009

gervaz wrote:

> On Jan 19, 4:01 pm, Ant <ant... at gmail.com> wrote:
>> A 0-width positive lookahead is probably what you want here:
>>
>> >>> s = """
>>
>> ... hdhd <a href="http://mysite.com/blah.html">Test <i>String</i> OK</
>> a>
>> ...
>> ... """>>> p = r'href="(http://mysite.com/[^"]+)">(.*)(?=</a>)'
>> >>> m = re.search(p, s)
>> >>> m.group(1)
>>
>> 'http://mysite.com/blah.html'>>> m.group(2)
>>
>> 'Test <i>String</i> OK'
>>
>> The (?=...) bit is the lookahead, and won't consume any of the string
>> you are searching. I've binned the named groups for clarity.
>>
>> The beautiful soup answers are a better bet though - they've already
>> done the hard work, and after all, you are trying to roll your own
>> partial HTML parser here, which will struggle with badly formed html...
> 
> Ok, thank you all, I'll take a look at beautiful soup, albeit the
> lookahead solution fits better for the little I have to do.

Little things tend to get out of hand quickly... This is the reason why so
many gave you the hint.

Diez