Regexp

Diez B. Roggisch deets at nospam.web.de
Mon Jan 19 09:50:16 EST 2009


gervaz wrote:

> Hi all, I need to find all the address in a html source page, I'm
> using:
> 'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
> b>)?</a>'
> but the [^</a>]+ pattern retrieve all the strings not containing <
> or / or a etc, although I just not want the word "</a>". How can I
> specify: 'do not search the string "blabla"?'

You should consider using BeautifulSoup or lxml2's error-tolerant parser to
work with HTML-documents. 

Sooner or later your regex-based processing is bound to fail, as documents
get more complicated. Better to use the right tool for the job.

The code should look like this (untested):

from BeautifulSoup import BeautifulSoup
html = """<html><a href="http://mysite.com/foobar/baz">link</a></html>"""

res = []
soup = BeautifulSoup(html)
for tag in soup.findAll("a"):
    if tag["href"].startswith("http://mysite.com"):
       res.append(tag["href"])


Not so hard, and *much* more robust.

Diez



More information about the Python-list mailing list