webspider, regexp not working, why?

alex23 wuwei23 at gmail.com
Fri May 23 22:39:19 EDT 2008


On May 24, 3:26 am, "Reedick, Andrew" <jr9... at ATT.COM> wrote:
> c)  If you're going to parse html/xml then bite the bullet and learn one
> of the libraries specifically designed to parse html/xml.  Many other
> regex gurus have learned this lesson.  Myself included.  =)

Agreed. The BeautifulSoup approach is particularly nice (although not
part of stdlib):

>>> import urllib
>>> from BeautifulSoup import BeautifulSoup
>>> html = urllib.urlopen('http://www.python.org/').read()
>>> soup = BeautifulSoup(html)
>>> links = [link['href'] for link in soup('link')]
>>> links[0]
u'http://www.python.org/channews.rdf'

- alex23




More information about the Python-list mailing list