Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?

Filip Salomonsson filip.salomonsson at gmail.com
Tue Oct 2 10:10:02 EDT 2007


On 02/10/2007, John Nagle <nagle at animats.com> wrote:
>
> But there's something in there now that robotparser doesn't like.
> Any ideas?

Wikipedia denies _all_ access for the standard urllib user agent, and
when the robotparser gets a 401 or 403 response when trying to fetch
robots.txt, it is equivalent to "Disallow: *".

http://infix.se/2006/05/17/robotparser

It could also be worth mentioning that if you were planning on
crawling a lot of Wikipedia pages, you may be better off downloading
the whole thing instead: <http://download.wikimedia.org/>
(perhaps adding <http://code.google.com/p/wikimarkup/> to convert the
wiki markup to HTML).
-- 
filip salomonsson



More information about the Python-list mailing list