[Catalog-sig] why is the wiki being hit so hard?

Laura Creighton lac at openend.se
Sun Aug 5 07:59:20 CEST 2007


In a message of Sat, 04 Aug 2007 09:42:45 +0200, "Martin v. Löwis" writes:
>> If they do not respect them, then you can use this program:
>> http://danielwebb.us/software/bot-trap/ to catch them.
>> If you are doing this, Martin, use the German version instead:
>> http://www.spider-trap.de/
>> because it has a few useful additions.  I forget what now.
>> 
>> Most scrapers, these days, respect robots.txt which will make this
>> program useless for catching them.  But some days you can get lucky.
>
>That would also be an idea. I'll see how the throttling works out;
>if it fails (either because it still gets overloaded - which shouldn't
>happen - or because legitimate users complain), I'll try that one.
>
>Regards,
>Martin

pardon for this completely useless quoting of irrelevant text
but I tried just telling catalog-sig to go read this url
http://search.msn.com.my/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIndexing.htm&FORM=WFDD#D
and check MSNbot is crawling my site too frequently.

and i got suspiciopus header, which is what all the python.org
groups say when they think you are sendng them spam, and not
in the header.  So if your text is basically a url, and you
want to send it to a python.org group you are screwed.  So I
find an article and reply.

Go read that.

I think it says that we could set our crawl delay to some number
-- why 120 I have no clue -- and our spider will be made
behave.  Or possibly we can hack the bot trap for those as not
respect crawl-delay.

at any rate seems relevant to our problem

Laura


More information about the Catalog-SIG mailing list