Problem when fetching page using urllib2.urlopen

Piet van Oostrum piet at cs.uu.nl
Mon Aug 10 15:36:55 EDT 2009


>>>>> jitu <nair.jitendra at gmail.com> (j) wrote:

>j> Hi,
>j> A html page  contains 'anchor' elements with 'href' attribute  having
>j> a semicolon  in the url , while fetching the page using
>j> urllib2.urlopen, all such href's  containing  'semicolons' are
>j> truncated.


>j> For example the href http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL
>j> get truncated to http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i

>j> The page I am talking about can be fetched from
>j> http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--

It's not python that causes this. It is the server that sends you the
URLs without these parameters (that's what they are).

To get them you have to tell the server that you are a respectable
browser. E.g.

import urllib2

url = 'http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL'

url = 'http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--'

hdrs = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13',
       'Accept': 'image/*'}

request = urllib2.Request(url = url, headers = hdrs)
page = urllib2.urlopen(request).read()

-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org



More information about the Python-list mailing list