parse html rendered by js

john osborne6 at gmail.com
Sat Feb 12 08:57:37 EST 2011


Even though I've never tried it, you may want to look into running the html thru a separate javascript engine, like spidermonkey or rhino, and then parse the results of that.

On Friday, February 11, 2011 2:20:32 AM UTC-6, yanghq wrote:
> hi,
>     I wanna get attribute value like href,src... in html.
> 
>     for simple html page libxml2dom can help me parse it into dom, and
> get what  I want;
> 
>     but for some pages rendered by js, like:
> 
> document.write(
> '<frameset border="0" frameborder="no" rows="0,*,0" onLoad="start()"
> onUnload="end()" onResize="change()">'+
>   '<frameset border="0" frameborder="no" cols="*,*,*,*,*,0">'+
> '<frame name="cfgFrame" noresize scrolling="no"
> src="../frame.html?rtfPossible=' + rtfPossibleString + '">'+
> '<frame name="mboxFrame" noresize scrolling="no"
> src="../frame.html?rtfPossible=' + rtfPossibleString + '">'+
> '<frame name="cmdFrame" noresize scrolling="no"
> src="../frame.html?rtfPossible=' + rtfPossibleString + '">'+
> '<frame name="msgFrame" noresize scrolling="no"
> src="../frame.html?rtfPossible=' + rtfPossibleString + '">'+
> '<frame name="pabFrame" noresize scrolling="no"
> src="../frame.html?rtfPossible=' + rtfPossibleString + '">'+
> '<frame name="cnFrame" noresize scrolling="no" src="../frame.html?' +
> main.clientargs + '">'+
>   ''+
>   '<frame name="mailFrame" marginwidth="0" marginheight="0" noresize
> src="../frame.html?rtfPossible=' + rtfPossibleString + '">'+
> '<frame name="appletFrame" marginwidth="0" marginheight="0" noresize
> src="../frame.html?rtfPossible=' + rtfPossibleString + '">'+
> ''
> )
> how can I get the atrribute value of 'src', thank you for any help.
> 





More information about the Python-list mailing list