help!! *extra* tricky web page to extract data from...

Paul Rubin http
Tue Mar 13 19:14:10 EDT 2007


"Diez B. Roggisch" <deets at nospam.web.de> writes:
> Obviously this wouldn't really help, as you can't predict what a
> website actually wants which events, in possibly which
> order. Especially if the site does not _want_ to be scrapable- think
> of a simple "click on the images in the order of the numbers shown on
> them" captcha.

Sure, but most sites don't go to such lengths, and even captchas can
be defeated if you're trying to scrape a specific site and are willing
to spend effort on the particular captcha generator that it uses.
Plus there is always www.captchasolver.com (!).

> Most time it's easier to sniff the http stream & grab the data directly.

Certainly true, but there are times when you have to pull stuff out of
the JS.  It's usually possible to do that without actually
interpreting the JS, but an interpreter would make it a lot more
convenient some of the time.



More information about the Python-list mailing list