Utility to screenscrape sites using javascript ?

Nobody nobody at nowhere.com
Sun Jan 31 15:57:44 EST 2010


On Sat, 30 Jan 2010 11:28:47 -0800, KB wrote:

>> > I have a service I subscribe to that uses javascript to stream news.
> 
>> There's a Python interface to SpiderMonkey (Mozilla's JavaScript
>> interpreter):
>>
>> http://pypi.python.org/pypi/python-spidermonkey
> 
> Thanks! I don't see a documentation page, but how would one use this?
> Would you download the HTML using urllib2/mechanize, then parse for the
> .js script and use spider-monkey to "execute" the script and the output is
> passed back to python?

Something like that.

The "homepage" link:

	http://github.com/davisp/python-spidermonkey

has some examples further down. The first one starts with:

	>>> import spidermonkey
	>>> rt = spidermonkey.Runtime()
	>>> cx = rt.new_context()
	>>> cx.execute("var x = 3; x *= 4; x;")
	12

It goes on to mention the .add_global(name, object) method, using it to
name a Python object such that its variables and methods can be accessed
from JS.

For scraping web pages, I suspect that you'll at least need to create an
object and name it "document", so that "document.location" etc works. How
much you'll need to implement will depend upon what the web page uses,
although you can probably use .__getattr__() to serve up dummy handlers
for calls which can be ignored.

Futher documentation is probably a case of "read the spidermonkey
documentation, then read the python-spidermonkey source if it isn't clear
how the wrapper relates to spidermonkey itself".




More information about the Python-list mailing list