[Catalog-sig] why is the wiki being hit so hard?

Laura Creighton lac at openend.se
Sat Aug 4 09:24:15 CEST 2007


One possibility is that we are being scraped.  Some jerk comes along
and copies all your web content, and runs his own mirror so that he
can get revenue from AdWords.  One thing to check is whether the
spider respects robots.txt.

If they do not respect them, then you can use this program:
http://danielwebb.us/software/bot-trap/ to catch them.
If you are doing this, Martin, use the German version instead:
http://www.spider-trap.de/
because it has a few useful additions.  I forget what now.

Most scrapers, these days, respect robots.txt which will make this
program useless for catching them.  But some days you can get lucky.

I think the only real fix for this is for Google and other searchers
to set up a service where people who produce web content that is
scraped and rehosted can report the rehosting sites and make google
rank them as the millionth site or so.  I.e. this is a political
and economic problem, not a technical one.

Laura


More information about the Catalog-SIG mailing list