Python- javascript

lkcl luke.leighton at googlemail.com
Sun Sep 6 11:09:14 EDT 2009


On Aug 16, 12:02 am, Mike Paul <paul.mik... at gmail.com> wrote:
> I'm trying to scrap a dynamic page with lot ofjavascriptin it.
> Inorder to get all the data from the page i need to access thejavascript. But i've no idea how to do it.
>
> Say I'm scraping some site htttp://www.xyz.com/xyz
>
> request=urllib2.Request("htttp://www.xyz.com/xyz")
> response=urllib2.urlopen(request)
> data=response.read()
>
> So i get all the data on the initial page. Now i need to access thejavascripton this page to get additional details. I've heard someone
> telling me to use spidermonkey. But no idea  on how to send javscript
> as request and get the response. How hsuld i be sending thejavascript
> request as ? how can it be sent?

 you need to actually _execute_ the web page under a browser engine.

 you will not be able to do what you want, using urllib.


> Can anyone tell me how can i do it very clearly. I've been breaking my
> head into this for the past few days with no progress.

 there are about four or five engines that you can use, depending on
the target platform.

see: http://wiki.python.org/moin/WebBrowserProgramming


1) python-khtml (pykhtml).
2) pywebkitgtk (with DOM / glib-gobject bindings patches)
3) python-hulahop and xulrunner
4) Trident (the MSHTML engine behind IE) accessed through python
comtypes
5) macosx objective-c bindings and use pyobjc from there.

options 2-4 i have successfully used and proven that it can be done:

http://pyjamas.svn.sourceforge.net/viewvc/pyjamas/trunk/pyjd/

option 1) i haven't done due to an obscure bug in the KDE KHTML python-
c++ bindings; option 2) i haven't done because there's no point:
XMLHttpRequest has been deliberately excluded due to short-sightedness
of the webkit developers, which has only recently been corrected (but
the work still needs to be done).

so, using a web browser engine, you must load and execute the page,
and then you can use DOM manipulation to extract the web page text,
after a certain amount of time has elapsed, and the javascript has
completed execution.

if you _really_ want to create your own javascript execution engine,
which, my god it will be a hell of a lot of work but would be
extremely beneficial, you would do well to help flier liu with pyv8,
and paul bonser with pybrowser.

flier is doing a web-site-scraping system, browsing millions of pages
and executing the javascript under pyv8.  paul is implementing a web
browser in pure python (using python-cairo as the graphics engine).
he's got part-way through the project, having focussed initially on a
W3C standards-compliant implementation of the DOM, and less on the
graphics side.  that means that what paul has will be somewhat more
suited to what you need, because you don't want a graphics engine at
all.

if paul's work isn't suitable, then the above engines you will simply
have to run _without_ firing up the actual GUI window.  in the GTK-
based engines, you just... don't call show() or show_all(); in the
MSHTML-based one, i presume you just don't fire a WM_SHOW event at it.

you'll work it out.

l.



More information about the Python-list mailing list