Error getting data from website

Peter Otten __peter__ at web.de
Sat Dec 7 05:53:59 EST 2019


Michael Torrie wrote:

> On 12/6/19 5:31 PM, DL Neil via Python-list wrote:
>> If you read the HTML data that the REPL has happily splattered all over
>> your terminal's screen (scroll back) (NB "soup" is easier to read than
>> is "content"!) you will observe that what you saw in your web-browser is
>> not what Amazon served in response to the Python "requests.get()"!
> 
> Sadly it's likely that Amazon's page is largely built from javascript.

That's not the problem here. Quoting the html returned by

requests.get("https://www.amazon.ca/dp/B07RZFQ6HC")

"""
To discuss automated access to Amazon data please contact api-services-
support at amazon.com.
"""

If you retrieve the page manually:

$ wget "https://www.amazon.ca/dp/B07RZFQ6HC" -O tmp.gz
[...]
2019-12-07 11:47:03 (80,6 KB/s) - »tmp.gz« gespeichert [115426]

$ gunzip tmp.gz
$ python3
[...]
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open("tmp").read())
>>> soup.find("span", dict(id="priceblock_dealprice")
... )
<span class="a-size-medium a-color-price priceBlockDealPriceString" 
id="priceblock_dealprice">CDN$ 1,019.00</span>
>>> _.text
'CDN$\xa01,019.00'

> So scraping static html is probably not going to get you where you want
> to go.  

... because Amazon doesn' like what you do. You can cheat or play by their 
rules and use the API.

> There are heavier tools, such as Selenium that uses a real
> browser to grab a page, and the result of that you can parse and search
> perhaps.





More information about the Python-list mailing list