[Tutor] can I walk or glob a website?

Hugo Arts hugo.yoshi at gmail.com
Wed May 18 19:56:06 CEST 2011


On Wed, May 18, 2011 at 7:32 PM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
>
> ===> Thanks for your reply. I tried wget, which seems to be a very handy
> tool. However, it doesn't work on this particular site. I tried wget -e
> robots=off -r -nc --no-parent -l6 -A.pdf
> 'http://www.landelijkregisterkinderopvang.nl/' (the quotes are there because
> I originally used a deeper link that contains ampersands). I also tested it
> on python.org, where it does work. Adding -e robots=off didn't work either.
> Do you think this could be a protection from the administrator?
>

wget works by recursively following hyperlinks from the page you
supply. The page you entered leads to a search form (which wget
wouldn't know how to fill out) but nothing else, so wget cannot
retrieve any of the pdf documents.

I think your best approach is the brute-force id generation you
mentioned earlier. be polite about this: wait a few seconds after four
consecutive failed attempts, download only one pdf at a time, wait a
second or two after each download, that kind of thing. Just don't
flood the server.


More information about the Tutor mailing list