urllib behaves strangely

Mon Jun 12 18:53:56 EDT 2006

Duncan Booth <duncan.booth at invalid.invalid> writes:

> Gabriel Zachmann wrote:
> 
> > Here is a very simple Python script utilizing urllib:
[...]
> > "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
> > cal" 
> >      print url
> >      print
> >      file = urllib.urlopen( url )
[...]
> > However, when i ecexute it, i get an html error ("access denied").
> > 
> > On the one hand, the funny thing though is that i can view the page
> > fine in my browser, and i can download it fine using curl.
[...]
> > 
> > On the other hand, it must have something to do with the URL because
> > urllib works fine with any other URL i have tried ...
> > 
> It looks like wikipedia checks the User-Agent header and refuses to send 
> pages to browsers it doesn't like. Try:
[...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it.  (I don't know what wikipedia's policies
are, though)

John