[Tutor] Beautiful Soup

Alan Gauld alan.gauld at btinternet.com
Sun Dec 13 04:10:57 EST 2015


On 13/12/15 07:44, Crusier wrote:
> Dear All,
> 
> I am trying to scrap the following website, however, I have
> encountered some problems. As you can see, I am not really familiar
> with regex and I hope you can give me some pointers to how to solve
> this problem.

I'm not sure why you mention regex because your script doesn't
use regex. And for html that's a good thing.

> I hope I can download all the transaction data into the database.
> However, I need to retrieve it first. The data which I hope to
> retrieve it is as follows:
> 
> "
> 15:59:59     A     500     6.790     3,395
> 15:59:53     B     500     6.780     3,390................
> 

Part of your problem is that the data is not in html format but
is in fact part of the Javascript code on the page. And
BeautifulSoup is not so good at parsing Javascript.

The page code looks like

<script type="text/javascript"
src="../js/jquery.js?verID=20150826_153700"></script>
<script type="text/javascript"
src="../js/common_eng.js?verID=20150826_153700"></script>
<script type="text/javascript"
src="../js/corsrequest.js?verID=20150826_153700"></script>
<script type="text/javascript"
src="../js/wholedaytran.js?verID=20150826_153700"></script>
<script type="text/javascript">
var json_result =
{"content":{"0":{"code":"6,881","timestamp":"15:59:59","order":"1175","transaction_type":"","bidask":"...{"code":"6,881","timestamp":"15:59:53","order":"1174","transaction_type":"",...{"code":"6,881","timestamp":"15:59:53","order":"1173",...

followed by a bunch of function definitions and other stuff.


> def turnover_detail(url):
>     response = requests.get(url)
>     html = response.content
>     soup = BeautifulSoup(html,"html.parser")
>     data = soup.find_all("script")
>     for json in data:
>         print(json)

The name json here is misleading because it's not really
json data at this point but javascript code. You will
need to further filter the code lines down to the ones
containing data and then convert them into pure json.

You don't show us the output but I'm assuming it's
the full javascript program? If so you need a second
level of parsing to extract the data from that.

It shouldn't be too difficult since the data you want
all starts with the string {"code": apart from the
first line which will need a little bit extra work.
But I don't think you really need any regex to do
this, regular string methods should suffice.

I suggest you should write a helper function to do
the data extraction and experiment in the interpreter
using some cut n pasted sample data till you get it right!

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list