parse html rendered by js

Alex Willmer alex at moreati.org.uk
Fri Feb 11 09:46:01 EST 2011


On Feb 11, 8:20 am, yanghq <yan... at neusoft.com> wrote:
> hi,
>     I wanna get attribute value like href,src... in html.
>
>     for simple html page libxml2dom can help me parse it into dom, and
> get what  I want;
>
>     but for some pages rendered by js, like:
>
> document.write(
> '<frameset border="0" frameborder="no" rows="0,*,0" onLoad="start()"
> onUnload="end()" onResize="change()">'+
>   '<frameset border="0" frameborder="no" cols="*,*,*,*,*,0">'+
> '<frame name="cfgFrame" noresize scrolling="no"
> src="../frame.html?rtfPossible=' + rtfPossibleString + '">'+
> ...
> )
> how can I get the atrribute value of 'src', thank you for any help.

You can
- Duplicate the Javascript behaviour in your Python code (i.e. rewrite
it yourself). Whenever the javascript changes you will need to update
your duplicate code.
- Use or write a Python module that uses a web browser to download/
execute the page. I'm not aware of any that exist.

Neither option is very good, and that is one reason why such
Javascript is considered bad practise. What is your end goal - e.g.
testing a web application, screen scraping a web site)? There may be a
better way.

Alex



More information about the Python-list mailing list