Regexp

Ant antroy at gmail.com
Mon Jan 19 10:01:02 EST 2009


A 0-width positive lookahead is probably what you want here:

>>> s = """
... hdhd <a href="http://mysite.com/blah.html">Test <i>String</i> OK</
a>
...
... """
>>> p = r'href="(http://mysite.com/[^"]+)">(.*)(?=</a>)'
>>> m = re.search(p, s)
>>> m.group(1)
'http://mysite.com/blah.html'
>>> m.group(2)
'Test <i>String</i> OK'

The (?=...) bit is the lookahead, and won't consume any of the string
you are searching. I've binned the named groups for clarity.

The beautiful soup answers are a better bet though - they've already
done the hard work, and after all, you are trying to roll your own
partial HTML parser here, which will struggle with badly formed html...



More information about the Python-list mailing list