[Tutor] can I walk or glob a website?

Wed May 18 13:13:17 CEST 2011

On Wed, 18 May 2011 07:06:07 pm Albert-Jan Roskam wrote:
> Hello,
>
> How can I walk (as in os.walk) or glob a website? 

If you're on Linux, use wget or curl.

If you're on Mac, you can probably install them using MacPorts.

If you're on Windows, you have my sympathies.

*wink*

> I want to download 
> all the pdfs from a website (using urllib.urlretrieve), 

This first part is essentially duplicating wget or curl. The basic 
algorithm is:

- download a web page
- analyze that page for links 
  (such <a href=...> but possibly also others)
- decide whether you should follow each link and download that page
- repeat until there's nothing left to download, the website blocks 
  your IP address, or you've got everything you want

except wget and curl already do 90% of the work.

If the webpage requires Javascript to make things work, wget or curl 
can't help. I believe there is a Python library called Mechanize to 
help with that. For dealing with real-world HTML (also known 
as "broken" or "completely f***ed" HTML, please excuse the 
self-censorship), the library BeautifulSoup may be useful.

Before doing any mass downloading, please read this:

http://lethain.com/an-introduction-to-compassionate-screenscraping/

> extract 
> certain figures (using pypdf- is this flexible enough?) and make some
> statistics/graphs from those figures (using rpy and R). I forgot what
> the process of 'automatically downloading' is called again, something
> that sounds like 'whacking' (??)

Sometimes called screen or web scraping, recursive downloading, or 
copyright-infringement *wink*

http://en.wikipedia.org/wiki/Web_scraping

-- 
Steven D'Aprano