read web page that requires javascript on client

lkcl luke.leighton at googlemail.com
Sun Mar 29 08:07:00 EDT 2009


On Mar 18, 8:01 pm, Greg <gregsaundersem... at gmail.com> wrote:
> Hello all, I've been trying to find a way to fetch and read a web page
> that requires javascript on the client side and it seems impossible.

 you're right: it's not impossible.

> I've read several threads in this group that say as much but I just
> can't believe it to be true

 you're right: it's not true.

> (I'm subscribing to the "argument of
> personal incredulity " here).

 there are several approaches that you can take that combine python
and javascript: none of them are at the level of "simplicity" which
you and many others may be expecting, which is why it's believed to be
"impossible" or "not achievable".

they all have different advantages and disadvantages - don't be
surprised if you end up with 30 mb of binaries on your system, _just_
to support the features you're implicitly asking for, ok?

 here's the approaches i've found so far:

 1) python-spidermonkey

 python-spidermonkey "rips out" the mozilla javascript engine and
provides you with a hybrid mechanism where the execution context can
be shared between the two languages.

 in other words, variables and functions can be shoved into the
namespace of the spidermonkey javascript context and executed; python
can likewise (in a rather clunky way at the moment) gain access to the
execution context and "call in".

 what this approach does NOT have is the "DOM model" functions.  those
have been REMOVED as they are ONLY part of the W3C specification for
implementation of web browsers, NOT the ECMAScript specification.

 2) PyV8 - http://code.google.com/p/pyv8

 take 1) above, and sed -e "s/python-spidermonkey/pyv8/g"

 flier liu, the author of pyv8, is actually _doing_ what you want to
do.  namely, he's started with a combination of python plus google's
V8 javascript engine, and he's now moving on to implementing the DOM
as *python*, for execution as a python console-only application.

 he recognises the need for execution of javascript, as part of the
requirement, and that's the reason why he has added google v8.

 by doing this "hybrid", he will be able to "add" a global variable
called "document" to the javascript context, and another global
variable called "window" to the javascript context, etc. etc. and then
"execution" of the javascript will result in callbacks - into python -
to emulate, in its entirety, the complete W3C DOM model standard.

 far be it for me to tell him how monstrously large the task of
reimplementing the W3C DOM standard in python, i urge you to consider
helping him out with his project.

 3) pywebkitgtk (+patch #13) + webkit-glib/gdom (+patch #16401)

 this one's a whopping-great project that takes the ENTIRE webkit
engine, patched to include glib / gobject bindings, so that python can
"get at" the DOM model, directly.

 you can use this to "execute" a web page - bear in mind that GTK apps
do NOT have to be "visible" - you CAN "run" a GTK app WITHOUT actually
putting up an on-screen GUI widget.

 in this way, you will be able to "load" a web page, have it be
"executed", and then, after a specific and arbitrary amount of time,
run some python using the python-bindings to the DOM model to  either
"walk" the DOM model or just call the "toString()" method and obtain a
flat HTML representation of the entire page.

CAVEATS: apple's employees are flexing their muscles and are
unfortunately showing that they have power and control by limiting the
functionality of the glib / gobject bindings to "that which they deem
to be acceptable".  apple's employees have deemed that strict
compliance to the W3C standard is how they want things to be, and are
ignoring the fact that the de-facto standard is actually that
specified by Javascript implementations.

in other words, toString, being a de-facto standard, is "unacceptable"
to them, as are a couple of other things.

 4) python-hulahop

 exactly the same as 3) except using mozilla not webkit: hulahop is
the ENTIRE gecko engine, with python bindings via the XUL interface.
the hulahop team are the ONLY people who have been able to understand
the obtuse XUL interface enough to be able to make python bindings
actually _work_ :)

it's clear that the OLPC / SUGAR team looked at webkit, initially, and
loved it.  however, they saw the lack of glib/gobject bindings, and
the lack of python bindings, and freaked out (whereas i, rather
stupidly, went "nooo problem saah!" and _added_ glib / gobject
bindings to webkit)

so they then went "ahhhh, safety", abandoned webkit and made a beeline
for XUL.

so they have complete and total control over the DOM model, from
python, including (thanks to gecko's ability to execute javascript
using spidermonkey) the ability to interact two-way with javascript
(exactly as can be done with webkit's glib/gobject + pywebkitgtk
bindings).

so - _again_ - you have the choice of being able to run a GTK app -
without an actual "window" - load up a web page and then tell the
XUL / Gecko engine "GO!  EXECUTE JAVASCRIPT!", and then, at some point
in the future, walk the DOM model using the python XUL bindings or
call the document.toString() method, from python, and obtain the
resultant HTML.


so - the answer to your question is: yes, it's technically possible.
and yes, it's even been done (twice).  successfully.  in two separate
and distinct ways, with at least a third in active development that i
know of, and a fourth method as a possible candidate for the basis of
a fourth alternative.

but i have to warn you - these are _not_ small projects: you're
relying on and leveraging the expertise of e.g. Webkit means that
you're backed by MAN CENTURIES of effort ( see the statistics e.g. on
http://www.ohloh.net/p/WebKit : an estimated 480 man-years of time
spent so far - if you look at mozilla you'll find it's a similar
amount )

l.



More information about the Python-list mailing list