Help extracting info from HTML source ..

Nikita the Spider NikitaTheSpider at gmail.com
Fri Jan 26 13:59:46 EST 2007


In article <1169819118.201093.267320 at h3g2000cwc.googlegroups.com>,
 "Miki" <miki.tebeka at gmail.com> wrote:

> Hello Shelton,
> 
> >   I am learning Python, and have never worked with HTML.  However, I would
> > like to write a simple script to audit my 100+ Netware servers via their web
> > portal.
> Always use the right tool, BeautilfulSoup
> (http://www.crummy.com/software/BeautifulSoup/) is best for web
> scraping (IMO).
> 
> from urllib import urlopen
> from BeautifulSoup import BeautifulSoup
> 
> html = urlopen("http://www.python.org").read()
> soup = BeautifulSoup(html)
> for link in soup("a"):
> 	print link["href"], "-->", link.contents

Agreed. HTML scraping is really complicated once you get into it. It 
might be interesting to write such a library just for your own 
satisfaction, but if you want to get something done then use a module 
that already written, like BeautifulSoup. Another module that will do 
the same job but works differently (and more simply, IMO) is HTMLData by 
Connelly Barnes:
http://oregonstate.edu/~barnesc/htmldata/

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more



More information about the Python-list mailing list