Re‭: ‬get wikipedia source failed‭ (‬urrlib2‭)‬

Michael J‭. ‬Fromberger Michael.J.Fromberger at Clothing.Dartmouth.EDU
Tue Aug 7 10:18:05 EDT 2007


In article‭ <‬1186476847.728759.166610 at o61g2000hsh.googlegroups.com‭>,‬
‭ ‬shahargs at gmail.com wrote‭:‬

‭> ‬Hi‭,‬
‭> ‬I'm trying to get wikipedia page source with urllib2‭:‬
‭>     ‬usock‭ = ‬urllib2‭.‬urlopen‭("‬http‭://‬en.wikipedia.org/wiki‭/‬
‭> ‬Albert_Einstein‭")‬
‭>     ‬data‭ = ‬usock.read‭();‬
‭>     ‬usock.close‭();‬
‭>     ‬return data
‭> ‬I got exception because HTTP 403‭ ‬error‭. ‬why‭? ‬with my browser i can't
‭> ‬access it without any problem‭?‬
‭> ‬
‭> ‬Thanks‭,‬
‭> ‬Shahar‭.‬

It appears that Wikipedia may inspect the contents of the User-Agent‭ ‬
HTTP header‭, ‬and that it does not particularly like the string it‭ ‬
receives from Python's urllib‭.  ‬I was able to make it work with urllib‭ ‬
via the following code‭:‬

import urllib

class CustomURLopener‭ (‬urllib.FancyURLopener‭):‬
‭  ‬version‭ = '‬Mozilla/5.0‭'‬

urllib‭.‬_urlopener‭ = ‬CustomURLopener‭()‬

u‭ = ‬urllib.urlopen‭('‬http‭://‬en.wikipedia.org/wiki/Albert_Einstein‭')‬
data‭ = ‬u.read‭()‬

I'm assuming a similar trick could be used with urllib2‭, ‬though I didn't‭ ‬
actually try it‭.  ‬Another thing to watch out for‭, ‬is that some sites‭ ‬
will redirect a public URL X to an internal URL Y‭, ‬and will check that‭ ‬
access to Y is only permitted if the Referer field indicates coming from‭ ‬
somewhere internal to the site‭.  ‬I have seen both of these techniques‭ ‬
used to foil screen-scraping‭.‬

Cheers‭,‬
‭-‬M

‭-- ‬
Michael J‭. ‬Fromberger‭             | ‬Lecturer‭, ‬Dept‭. ‬of Computer Science
http‭://‬www.dartmouth.edu‭/‬~sting‭/  | ‬Dartmouth College‭, ‬Hanover‭, ‬NH‭, ‬USA



More information about the Python-list mailing list