[Tutor] Beautiful Soup

Tue Sep 29 18:51:20 CEST 2015

Crusier wrote:

> I have recently finished reading "Starting out with Python" and I
> really want to do some web scraping. Please kindly advise where I can
> get more information about BeautifulSoup. It seems that Documentation
> is too hard for me.

If you tell us what you don't understand and what you want to achieve we may 
be able to help you.

> from bs4 import BeautifulSoup
> import urllib.request
> 
> HKFile =
> 
urllib.request.urlopen("https://bochk.etnet.com.hk/content/bochkweb/tc/quote_transaction_daily_history.php?code=2388")
> HKHtml = HKFile.read() 
> HKFile.close()
> 
> print(HKFile)

> Furthermore, I have tried to scrap this site but it seems that there
> is an error (<http.client.HTTPResponse object at 0x02C09F90>). 

That's not an error, that's what urlopen() returns. If an error occurs 
Python libraries are usually explicit an throw an exception. If the 
exception is not handled by your script by default Python prints a traceback 
and exits. For example:

>>> import urllib.request
>>> urllib.request.urlopen("http://httpbin.org/status/404")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.4/urllib/request.py", line 469, in open
    response = meth(req, response)
  File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.4/urllib/request.py", line 507, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 587, in 
http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: NOT FOUND

That's what a well-behaved error looks like ;)

> Please advise what I should do in order to overcome this.

If you want to print the contents of the page just replace the line

> print(HKFile)

in your code with

print(HKHtml.decode("utf-8"))