urllib behaves strangely

Duncan Booth duncan.booth at invalid.invalid
Tue Jun 13 03:26:47 EDT 2006


John J. Lee wrote:

>> It looks like wikipedia checks the User-Agent header and refuses to
>> send pages to browsers it doesn't like. Try:
> [...]
> 
> If wikipedia is trying to discourage this kind of scraping, it's
> probably not polite to do it.  (I don't know what wikipedia's policies
> are, though)

They have a general policy against unapproved bots, which is
understandable since badly behaved bots could mess up or delete pages.
If you read the policy it is aimed at bots which modify wikipedia
articles automatically. 

http://en.wikipedia.org/wiki/Wikipedia:Bots says:
> This policy in a nutshell:
> Programs that update pages automatically in a useful and harmless way
> may be welcome if their owners seek approval first and go to great
> lengths to stop them running amok or being a drain on resources.

On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They 
even provide a link to some frameworks for writing bots e.g. 

http://sourceforge.net/projects/pywikipediabot/




More information about the Python-list mailing list