urllib behaves strangely
Duncan Booth
duncan.booth at invalid.invalid
Mon Jun 12 06:47:28 EDT 2006
Gabriel Zachmann wrote:
> Here is a very simple Python script utilizing urllib:
>
> import urllib
> url =
> "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
> cal"
> print url
> print
> file = urllib.urlopen( url )
> mime = file.info()
> print mime
> print file.read()
> print file.geturl()
>
>
> However, when i ecexute it, i get an html error ("access denied").
>
> On the one hand, the funny thing though is that i can view the page
> fine in my browser, and i can download it fine using curl.
>
> On the other hand, it must have something to do with the URL because
> urllib works fine with any other URL i have tried ...
>
It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'
request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)
...
That (or code very like it) worked when I tried it.
More information about the Python-list
mailing list