parsing javascript

Philip Semanchuk philip at semanchuk.com
Sun Oct 12 10:28:17 EDT 2008


On Oct 12, 2008, at 5:25 AM, S.Selvam Siva wrote:

> I have to do a parsing on webpagesand fetch urls.My problem is ,many  
> urls i
> need to parse are dynamically loaded using javascript function
> (onload()).How to fetch those links from python? Thanks in advance.

Selvam,
You can try to find them yourself using string parsing, but that's  
difficult. The closer you want to get to "perfect" at finding URLs  
expressed in JS, the closer you'll get to rewriting a JS interpreter.  
For instance, this is not so hard to understand:
    "http://example.com/"
but this is:
    "http://ZZZ_DOMAIN_ZZZ/index.html".replace(/ZZZ_DOMAIN_ZZZ/,  
the_domain_variable)

This is a long-standing problem for any program that parses Web pages.  
You either have to embed a JS interpreter in your application or just  
ignore the JavaScript. Most Web parsing robots take the latter route.

Good luck
Philip



More information about the Python-list mailing list