[Tutor] Strategy to read a redirecting html page

ian douglas ian.douglas at iandouglas.com
Wed Jun 1 01:39:52 CEST 2011



On 05/31/2011 04:34 PM, Hugo Arts wrote:
> On Wed, Jun 1, 2011 at 1:00 AM, Karim<karim.liateni at free.fr>  wrote:
>> Hello,
>>
>> I am having issue in reading a html page which is redirected to a new page.
>> I get the first warning/error message page and not the redirection one.
>> Should I request a second time the same url page or Should I loop forever
>> until the
>> page content is the correct (by parsing it) one?
>> Do you have a better strategy or perhaps some modules deal w/ that issue?
>> I am using python 2.7.1 on Linux ubuntu 11.04 and the modules urllib2,
>> urllib, etc...
>> The webpage is secured but I registered a password manager.
>>
> urllib2 works at the HTTP level, so it can't catch redirects that
> happen at the HTML level unfortunately. You'll have to parse the page,
> look for a<meta http-equiv="refresh" tag, and fetch the URL from it.
> That's a pretty simple parsing job, probably doable with regexes. But
> you're free to use a proper html parser of course.
>

Also, given that the 301/302 redirect you get in that response could 
ALSO redirect, I'd suggest looping until a counter is exhausted, so you 
don't end up in an infinite loop if pages redirect to each other.

-id



More information about the Tutor mailing list