Reâ: â¬get wikipedia source failedâ (â¬urrlib2â)â¬
Michael Jâ. â¬Fromberger
Michael.J.Fromberger at Clothing.Dartmouth.EDU
Tue Aug 7 10:18:05 EDT 2007
In articleâ <â¬1186476847.728759.166610 at o61g2000hsh.googlegroups.comâ>,â¬
â â¬shahargs at gmail.com wroteâ:â¬
â> â¬Hiâ,â¬
â> â¬I'm trying to get wikipedia page source with urllib2â:â¬
â> â¬usockâ = â¬urllib2â.â¬urlopenâ("â¬httpâ://â¬en.wikipedia.org/wikiâ/â¬
â> â¬Albert_Einsteinâ")â¬
â> â¬dataâ = â¬usock.readâ();â¬
â> â¬usock.closeâ();â¬
â> â¬return data
â> â¬I got exception because HTTP 403â â¬errorâ. â¬whyâ? â¬with my browser i can't
â> â¬access it without any problemâ?â¬
â> â¬
â> â¬Thanksâ,â¬
â> â¬Shaharâ.â¬
It appears that Wikipedia may inspect the contents of the User-Agentâ â¬
HTTP headerâ, â¬and that it does not particularly like the string itâ â¬
receives from Python's urllibâ. â¬I was able to make it work with urllibâ â¬
via the following codeâ:â¬
import urllib
class CustomURLopenerâ (â¬urllib.FancyURLopenerâ):â¬
â â¬versionâ = 'â¬Mozilla/5.0â'â¬
urllibâ.â¬_urlopenerâ = â¬CustomURLopenerâ()â¬
uâ = â¬urllib.urlopenâ('â¬httpâ://â¬en.wikipedia.org/wiki/Albert_Einsteinâ')â¬
dataâ = â¬u.readâ()â¬
I'm assuming a similar trick could be used with urllib2â, â¬though I didn'tâ â¬
actually try itâ. â¬Another thing to watch out forâ, â¬is that some sitesâ â¬
will redirect a public URL X to an internal URL Yâ, â¬and will check thatâ â¬
access to Y is only permitted if the Referer field indicates coming fromâ â¬
somewhere internal to the siteâ. â¬I have seen both of these techniquesâ â¬
used to foil screen-scrapingâ.â¬
Cheersâ,â¬
â-â¬M
â-- â¬
Michael Jâ. â¬Frombergerâ | â¬Lecturerâ, â¬Deptâ. â¬of Computer Science
httpâ://â¬www.dartmouth.eduâ/â¬~stingâ/ | â¬Dartmouth Collegeâ, â¬Hanoverâ, â¬NHâ, â¬USA
More information about the Python-list
mailing list