[Tutor] Beautiful Soup
Peter Otten
__peter__ at web.de
Tue Sep 29 18:51:20 CEST 2015
Crusier wrote:
> I have recently finished reading "Starting out with Python" and I
> really want to do some web scraping. Please kindly advise where I can
> get more information about BeautifulSoup. It seems that Documentation
> is too hard for me.
If you tell us what you don't understand and what you want to achieve we may
be able to help you.
> from bs4 import BeautifulSoup
> import urllib.request
>
> HKFile =
>
urllib.request.urlopen("https://bochk.etnet.com.hk/content/bochkweb/tc/quote_transaction_daily_history.php?code=2388")
> HKHtml = HKFile.read()
> HKFile.close()
>
> print(HKFile)
> Furthermore, I have tried to scrap this site but it seems that there
> is an error (<http.client.HTTPResponse object at 0x02C09F90>).
That's not an error, that's what urlopen() returns. If an error occurs
Python libraries are usually explicit an throw an exception. If the
exception is not handled by your script by default Python prints a traceback
and exits. For example:
>>> import urllib.request
>>> urllib.request.urlopen("http://httpbin.org/status/404")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 587, in
http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: NOT FOUND
That's what a well-behaved error looks like ;)
> Please advise what I should do in order to overcome this.
If you want to print the contents of the page just replace the line
> print(HKFile)
in your code with
print(HKHtml.decode("utf-8"))
More information about the Tutor
mailing list