[Tutor] Web crawling!
vince spicer
vinces1979 at gmail.com
Wed Jul 29 20:53:32 CEST 2009
On Wed, Jul 29, 2009 at 9:59 AM, Raj Medhekar <cosmicsand27 at yahoo.com>wrote:
> Does anyone know a good webcrawler that could be used in tandem with the
> Beautiful soup parser to parse out specific elements from news sites like
> BBC and CNN? Thanks!
> -Raj
>
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
>
I have used httplib2 http://code.google.com/p/httplib2/ to crawl sites(with
auth/cookies) and lxml (html xpath) to parse out links.
but you could use builtin urllib2 to request pages if no auth/cookie support
is required, here is a simple example
import urllib2
from lxml import html
page = urllib2.urlopen("http://this.page.com <http://this.page/>")
data = html.fromstring(page.read())
all_links = data.xpath("//a") # all links on the page
for link in all_links:
print link.attrib["href"]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090729/f48d1e45/attachment.htm>
More information about the Tutor
mailing list