Scraping a web page

Support Desk support.desk.ipg at gmail.com
Tue Apr 7 11:02:15 EDT 2009


If your only interested in the Images, perhaps you want to use wget like:

 

wget -r --accept=jpg,jpeg www.xyz.org

 

or maybe this

 

http://www.vex.net/~x/python_stuff.html

 

BackCrawler <http://www.vex.net/%7Ex/files/backcrawler.zip>  1.1

A crude web spider with only one purpose: mercilessly suck the background
images from all web pages it can find. Understands frames and redirects,
uses MD5 to elimate duplicates. Need web page backgrounds? This'll get lots
of them. Sadly, most are very tacky, and Backcrawler can't help with that.
Requires Threads.

 

 

  _____  

From: Ronn Ross [mailto:ronn.ross at gmail.com] 
Sent: Tuesday, April 07, 2009 9:37 AM
To: Support Desk
Subject: Re: Scraping a web page

 

This works great, but is there a way to do this with firefox or something
similar so I can also print the images from the site? 

On Tue, Apr 7, 2009 at 9:58 AM, Support Desk <support.desk.ipg at gmail.com>
wrote:

You could do something like below to get the rendered page.

Import os
site = 'website.com'
X = os.popen('lynx --dump %s' % site).readlines()








-----Original Message-----
From: Tim Chase [mailto:python.list at tim.thechases.com]
Sent: Tuesday, April 07, 2009 7:45 AM
To: Ronn Ross
Cc: python-list at python.org
Subject: Re: Scraping a web page

> f = urllib.urlopen("http://www.google.com")
> s = f.read()
>
> It is working, but it's returning the source of the page. Is there anyway
I
> can get almost a screen capture of the page?

This is the job of a browser -- to render the source HTML.  As
such, you'd want to look into any of the browser-automation
libraries to hook into IE, FireFox, Opera, or maybe using the
WebKit/KHTML control.  You may then be able to direct it to
render the HTML into a canvas you can then treat as an image.

Another alternative might be provided by some web-services that
will render a page as HTML with various browsers and then send
you the result.  However, these are usually either (1)
asynchronous or (2) paid services (or both).

-tkc








 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090407/c8003c21/attachment-0001.html>


More information about the Python-list mailing list