urllib behaves strangely

Mon Jun 12 06:47:28 EDT 2006

Gabriel Zachmann wrote:

> Here is a very simple Python script utilizing urllib:
> 
>      import urllib
>      url = 
> "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
> cal" 
>      print url
>      print
>      file = urllib.urlopen( url )
>      mime = file.info()
>      print mime
>      print file.read()
>      print file.geturl()
> 
> 
> However, when i ecexute it, i get an html error ("access denied").
> 
> On the one hand, the funny thing though is that i can view the page
> fine in my browser, and i can download it fine using curl.
> 
> On the other hand, it must have something to do with the URL because
> urllib works fine with any other URL i have tried ...
> 
It looks like wikipedia checks the User-Agent header and refuses to send 
pages to browsers it doesn't like. Try:

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; 
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)
...

That (or code very like it) worked when I tried it.