Javascript website scraping using WebKit and Selenium tools

dieter dieter at handshake.de
Thu Jul 2 01:48:20 EDT 2015


Veek M <vek.m1234 at gmail.com> writes:

> I tried scraping a javascript website using two tools, both didn't work. The 
> website link is: http://xdguo.taobao.com/category-499399872.htm The relevant 
> text I'm trying to extract is 'GY-68...':
>
> <div class="item3line1">
>
>     <dl class="item " data-id="38952795780">
>         <dt class="photo">
>             <a target="_blank" href="//item.taobao.com/item.htm?spm=a1z10.5-
> c.w4002-6778075404.11.54MDOI&id=38952795780" data-spm-wangpu-module-
> id="4002-6778075404" data-spm-anchor-id="a1z10.5-c.w4002-6778075404.11">
>                 <img 
> src="//img.alicdn.com/bao/uploaded/i4/TB1HMt3FFXXXXaFaVXXXXXXXXXX_!!0-
> item_pic.jpg_240x240.jpg" alt="GY-68 BMP180 ?? BOSCH?? ??????? ??
> BMP085"></img>
>             </a>
>         </dt>

> ...

When I try to access the link above, I am redirected to a
login page - which, of course, may look very different from what you expect.
You may need to pass on authentication information along with
your request in order to get the page you are expecting.

Note also, that todays sites often heavily use Javascript - which
means that a page only gets the final look when the Javascript
has been executed.


Once the problems to get the "final" HTML code solved,
I would use "lxml" and its "xpath" support to locate any
relevant HTML information.




More information about the Python-list mailing list