Reâ€: â€¬get wikipedia source failedâ€ (â€¬urrlib2â€)â€¬

Tue Aug 7 10:18:05 EDT 2007

In articleâ€ <â€¬1186476847.728759.166610 at o61g2000hsh.googlegroups.comâ€>,â€¬
â€ â€¬shahargs at gmail.com wroteâ€:â€¬

â€> â€¬Hiâ€,â€¬
â€> â€¬I'm trying to get wikipedia page source with urllib2â€:â€¬
â€>     â€¬usockâ€ = â€¬urllib2â€.â€¬urlopenâ€("â€¬httpâ€://â€¬en.wikipedia.org/wikiâ€/â€¬
â€> â€¬Albert_Einsteinâ€")â€¬
â€>     â€¬dataâ€ = â€¬usock.readâ€();â€¬
â€>     â€¬usock.closeâ€();â€¬
â€>     â€¬return data
â€> â€¬I got exception because HTTP 403â€ â€¬errorâ€. â€¬whyâ€? â€¬with my browser i can't
â€> â€¬access it without any problemâ€?â€¬
â€> â€¬
â€> â€¬Thanksâ€,â€¬
â€> â€¬Shaharâ€.â€¬

It appears that Wikipedia may inspect the contents of the User-Agentâ€ â€¬
HTTP headerâ€, â€¬and that it does not particularly like the string itâ€ â€¬
receives from Python's urllibâ€.  â€¬I was able to make it work with urllibâ€ â€¬
via the following codeâ€:â€¬

import urllib

class CustomURLopenerâ€ (â€¬urllib.FancyURLopenerâ€):â€¬
â€  â€¬versionâ€ = 'â€¬Mozilla/5.0â€'â€¬

urllibâ€.â€¬_urlopenerâ€ = â€¬CustomURLopenerâ€()â€¬

uâ€ = â€¬urllib.urlopenâ€('â€¬httpâ€://â€¬en.wikipedia.org/wiki/Albert_Einsteinâ€')â€¬
dataâ€ = â€¬u.readâ€()â€¬

I'm assuming a similar trick could be used with urllib2â€, â€¬though I didn'tâ€ â€¬
actually try itâ€.  â€¬Another thing to watch out forâ€, â€¬is that some sitesâ€ â€¬
will redirect a public URL X to an internal URL Yâ€, â€¬and will check thatâ€ â€¬
access to Y is only permitted if the Referer field indicates coming fromâ€ â€¬
somewhere internal to the siteâ€.  â€¬I have seen both of these techniquesâ€ â€¬
used to foil screen-scrapingâ€.â€¬

Cheersâ€,â€¬
â€-â€¬M

â€-- â€¬
Michael Jâ€. â€¬Frombergerâ€             | â€¬Lecturerâ€, â€¬Deptâ€. â€¬of Computer Science
httpâ€://â€¬www.dartmouth.eduâ€/â€¬~stingâ€/  | â€¬Dartmouth Collegeâ€, â€¬Hanoverâ€, â€¬NHâ€, â€¬USA

Reâ€­: â€¬get wikipedia source failedâ€­ (â€¬urrlib2â€­)â€¬

Reâ€: â€¬get wikipedia source failedâ€ (â€¬urrlib2â€)â€¬