Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?

John Nagle nagle at animats.com
Tue Oct 2 11:11:28 EDT 2007


Filip Salomonsson wrote:
> On 02/10/2007, John Nagle <nagle at animats.com> wrote:
>> But there's something in there now that robotparser doesn't like.
>> Any ideas?
> 
> Wikipedia denies _all_ access for the standard urllib user agent, and
> when the robotparser gets a 401 or 403 response when trying to fetch
> robots.txt, it is equivalent to "Disallow: *".
> 
> http://infix.se/2006/05/17/robotparser

     That explains it.  It's an undocumented feature of "robotparser",
as is the 'errcode' variable.  The documentation of "robotparser" is
silent on error handling (can it raise an exception?) and should be
updated.

> It could also be worth mentioning that if you were planning on
> crawling a lot of Wikipedia pages, you may be better off downloading
> the whole thing instead: <http://download.wikimedia.org/>
> (perhaps adding <http://code.google.com/p/wikimarkup/> to convert the
> wiki markup to HTML).

     This is for SiteTruth, the site rating system (see "sitetruth.com"),
and we never look at more than 21 pages per site.  We're looking for
the name and address of the business behind the web site, and if we
can't find that after looking in 20 of the most obvious places to
look, it's either not there or not "prominently disclosed".

				John Nagle



More information about the Python-list mailing list