Urllib2: Only a partial page retrieved

Dragon Lord dragonlordnld at gmail.com
Sat May 22 12:24:32 EDT 2010


Oops, het "Good" page is alos handled wrongly. The papers from 2000
are handled wrong too so a real example of a well performing page:

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5206867

On May 22, 11:43 am, Dragon Lord <dragonlord... at gmail.com> wrote:
> I am trying to download a few IEEE pages by using urllib2, but with
> certain pages I get only the first part of the page. With other pages
> from the same server and url (just another pageID) I get the right
> results. The difference between these pages seems to be the date the
> paper for which the page is was published. Any papers from before 2000
> end just before the date, pages from 2000 and later and at <\html>.
>
> Two example URLs:
>
> Does not work:http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=517048
> Does work:http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=854728
>
> I tried both urlopen and urlretrieve and tried both urllib and
> urllib2. With urlopen I tried both .read() and .read(10000) to make
> sure I got the whole page, but nothing helped.
> Sample code:
>
> import urllib2
> response = urllib2.urlopen("http://ieeexplore.ieee.org/xpl/
> freeabs_all.jsp?arnumber=517048")
> html = response.read()
> print html
>
> The cutoff is allways at the same location: just after the label
> "Meeting date" and before the date itself. Could it be that something
> is interpreted as and eof command or something like that?
>
> example of the cutoff point with a bad page:
> <br/><b>Meeting Date: </b>
>
> example of the cutoff point with a good page:
> <br/><b>Meeting Date: </b>
>
>                                                                                 13 jun 2000
>
> The bad pages do continue after this point btw. if you use a
> webbrowser, it does not seem to be a server problem.




More information about the Python-list mailing list