Problem with Python's "robots.txt" file parser in module robotparser
Nikita the Spider
NikitaTheSpider at gmail.com
Thu Jul 12 14:46:00 EDT 2007
In article <IEcli.26514$C96.1279 at newssvr23.news.prodigy.net>,
John Nagle <nagle at animats.com> wrote:
> Nikita the Spider wrote:
>
> >
> > Hi John,
> > Are you sure you're not confusing your sites? The robots.txt file at
> > www.ibm.com contains the double slashed path. The robots.txt file at
> > ibm.com is different and contains this which would explain why you
> > think all URLs are denied:
> > User-agent: *
> > Disallow: /
> >
> Ah, that's it. The problem is that "ibm.com" redirects to
> "http://www.ibm.com", but but "ibm.com/robots.txt" does not
> redirect. For comparison, try "microsoft.com/robots.txt",
> which does redirect.
Strange thing for them to do, isn't it? Especially with two such
different robots.txt files.
--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
More information about the Python-list
mailing list