[Tutor] Web crawling!

vince spicer vinces1979 at gmail.com
Wed Jul 29 20:53:32 CEST 2009


On Wed, Jul 29, 2009 at 9:59 AM, Raj Medhekar <cosmicsand27 at yahoo.com>wrote:

> Does anyone know a good webcrawler that could be used in tandem with the
> Beautiful soup parser to parse out specific elements from news sites like
> BBC and CNN? Thanks!
> -Raj
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
>

I have used httplib2 http://code.google.com/p/httplib2/ to crawl sites(with
auth/cookies) and lxml (html xpath) to parse out links.

but you could use builtin urllib2 to request pages if no auth/cookie support
is required, here is a simple example

import urllib2
from lxml import html

page = urllib2.urlopen("http://this.page.com <http://this.page/>")
data = html.fromstring(page.read())

all_links = data.xpath("//a") # all links on the page

for link in all_links:
    print link.attrib["href"]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090729/f48d1e45/attachment.htm>


More information about the Tutor mailing list