[Tutor] Beautiful Soup

bruce badouglas at gmail.com
Sun Dec 13 15:48:09 EST 2015


Hey Crusier/ (And Others...)

For your site...

As Alan mentioned, its a mix of html/jscript/etc..

So, you're going (or perhaps should) need to extract just the
json/struct that you need, and then go from there. I speak of
experience, as I've had to hande a number of sites that are
essentially just what you have.

Here's a basic guide to start:
--I use libxml, simplejson

fetch the page

in the page, do a split, to get the exact json (string) that you want.
-you'll do to splits, 1st gets rid of extra pre json stuff
 2nd gets rid of extra post json stuf that you don't need
--at this point, you should have the json string you need, or you
should be pretty close..

-now, you might need to "pretty" up what you have as py/json only
accepts key/value in certain format single/double quotes, etc..

once you've gotten this far, you might actually have the json string,
in which case, you can load it directly into the json, and proceed as
you wish.

you might also find that what you have, is really a py dictionary, and
you can handle that as well!

Have fun, let us know if you have issues...



On Sun, Dec 13, 2015 at 2:44 AM, Crusier <crusier at gmail.com> wrote:
> Dear All,
>
> I am trying to scrap the following website, however, I have
> encountered some problems. As you can see, I am not really familiar
> with regex and I hope you can give me some pointers to how to solve
> this problem.
>
> I hope I can download all the transaction data into the database.
> However, I need to retrieve it first. The data which I hope to
> retrieve it is as follows:
>
> "
> 15:59:59     A     500     6.790     3,395
> 15:59:53     B     500     6.780     3,390................
>
> Thank you
>
> Below is my quote:
>
> from bs4 import BeautifulSoup
> import requests
> import re
>
> url = 'https://bochk.etnet.com.hk/content/bochkweb/eng/quote_transaction_daily_history.php?code=6881&time=F&timeFrom=090000&timeTo=160000&turnover=S&sessionId=44c99b61679e019666f0570db51ad932&volMin=0&turnoverMin=0'
>
> def turnover_detail(url):
>     response = requests.get(url)
>     html = response.content
>     soup = BeautifulSoup(html,"html.parser")
>     data = soup.find_all("script")
>     for json in data:
>         print(json)
>
> turnover_detail(url)
>
> Best Regards,
> Henry
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor


More information about the Tutor mailing list