question about urllib and parsing a page

David Wahler dwahler at gmail.com
Wed Nov 2 14:43:33 EST 2005


nephish at xit.net wrote:
> hey there,
> i am using beautiful soup to parse a few pages (screen scraping)
> easy stuff.
> the issue i am having is with one particular web page that uses a
> javascript to display some numbers in tables.
>
> now if i open the file in mozilla and "save as" i get the numbers in
> the source. cool. but i click on the "view source" or download the url
> with urlretrieve, i get the source, but not the numbers.
>
> is there a way around this ?
>
> thanks

If the Javascript is automatically generated by the server with the
numbers in a known location, you can use a regular expression to
extract them. For example, if there's something in the code like:

    var numbersToDisplay = [123,456,789];

Then you could use: (warning, this is not fully tested):

    import re
    js_source = "... the source inside the <script> tag ..."
    numbers_str = re.search(r'numbersToDisplay = \[([^]]*)\];', \
        js_source).group(1)
    numbers_list = numbers_str.split(",")

You'll obviously have to vary this to match your particular script.
Bear in mind that this won't work if the values are computed in
JavaScript, instead of on the server. If that's the case, then unless
you feel like implementing a complete IE- and Mozilla-compatible
browser DOM and JavaScript interpreter, you're out of luck.

-- David




More information about the Python-list mailing list