[Tutor] can I walk or glob a website?

Wed May 18 11:51:35 CEST 2011

On 01/-10/-28163 02:59 PM, Alan Gauld wrote:
>
> "Albert-Jan Roskam" <fomcl at yahoo.com> wrote
>> How can I walk (as in os.walk) or glob a website?
>
> I don't think there is a way to do that via the web.
> Of course if you have access to the web servers filesystem you can use
> os.walk to do it as for any other filesystem, but I don't think its
> generally possible over http. (And indeed it shouldn''t be for very good
> security reasons!)
>
> OTOH I've been wrong before! :-)
>

It has to be (more or less) possible.  That's what google does for their 
search engine.

Three broad issues.

1) Are you violating the terms of service of such a web site?  Are you 
going to be doing this seldom enough that the bandwidth used won't be a 
DOS attack?  Are there copyrights to the material you plan to download? 
  Is the website protected by a login, by cookies, or a VPN?  Does the 
website present a different view to different browsers, different OS's, 
or different target domains?

2) Websites vary enormously in their adherence to standards.  There are 
many such standards, and browsers tend to be very tolerant of bugs in 
the site which will be painful for you to accomodate.  And some of the 
extensions/features are very hard to parse, such as flash.  Others, such 
as javascript, can make it hard to do it statically.

3) How important is it to do it reliably?  Your code may work perfectly 
with a particular website, and next week they'll make a change which 
breaks your code entirely.  Are you willing to rework the code each time 
that happens?

Many sites have API's that you can use to access them.  Sometimes this 
is a better answer.

With all of that said, I'll point you to Beautiful Soup, as a library 
that'll parse a page of moderately correct html and give you the 
elements of it.  If it's a static page, you can then walk the elements 
of the tree that Beautiful Soup gives you, and find all the content that 
interests you.  You can also find all the web pages that the first one 
refers to, and recurse on that.

Notice that you need to limit your scope, since many websites have 
direct and indirect links to most of the web. For example, you might 
only recurse into links that refer to the same domain.  For many 
websites, that means you won't get it all.  So you may want to supply a 
list of domains and/or subdomains that you're willing to recurse into.

See   http://pypi.python.org/pypi/BeautifulSoup/3.2.0

DaveA