urllib behaves strangely
John J. Lee
jjlee at reportlab.com
Mon Jun 12 18:53:56 EDT 2006
Duncan Booth <duncan.booth at invalid.invalid> writes:
> Gabriel Zachmann wrote:
>
> > Here is a very simple Python script utilizing urllib:
[...]
> > "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
> > cal"
> > print url
> > print
> > file = urllib.urlopen( url )
[...]
> > However, when i ecexute it, i get an html error ("access denied").
> >
> > On the one hand, the funny thing though is that i can view the page
> > fine in my browser, and i can download it fine using curl.
[...]
> >
> > On the other hand, it must have something to do with the URL because
> > urllib works fine with any other URL i have tried ...
> >
> It looks like wikipedia checks the User-Agent header and refuses to send
> pages to browsers it doesn't like. Try:
[...]
If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)
John
More information about the Python-list
mailing list