urlopen returns forbidden

Chris Rebert clp2 at rebertia.com
Mon Feb 28 01:19:18 EST 2011


On Sun, Feb 27, 2011 at 9:38 PM, monkeys paw <monkey at joemoney.net> wrote:
> I have a working urlopen routine which opens
> a url, parses it for <a> tags and prints out
> the links in the page. On some sites, wikipedia for
> instance, i get a
>
> HTTP error 403, forbidden.
>
> What is the difference in accessing the site through a web browser
> and opening/reading the URL with python urllib2.urlopen?

The User-Agent header (http://en.wikipedia.org/wiki/User_agent ).
"By default, the URLopener class sends a User-Agent header of
urllib/VVV, where VVV is the urllib version number."
    – http://docs.python.org/library/urllib.html

Some sites block obvious non-search-engine bots based on their HTTP
User-Agent header value.

You can override the urllib default:
http://docs.python.org/library/urllib.html#urllib.URLopener.version

Sidenote: Wikipedia has a proper API for programmatic browsing, likely
hence why it's blocking your program.

Cheers,
Chris



More information about the Python-list mailing list