Error getting data from website

Chris Angelico rosuav at gmail.com
Fri Dec 6 21:28:26 EST 2019


On Sat, Dec 7, 2019 at 1:21 PM DL Neil via Python-list
<python-list at python.org> wrote:
>
> On 7/12/19 1:51 PM, Chris Angelico wrote:
> > On Sat, Dec 7, 2019 at 11:46 AM Michael Torrie <torriem at gmail.com> wrote:
> >>
> >> On 12/6/19 5:31 PM, DL Neil via Python-list wrote:
> >>> If you read the HTML data that the REPL has happily splattered all over
> >>> your terminal's screen (scroll back) (NB "soup" is easier to read than
> >>> is "content"!) you will observe that what you saw in your web-browser is
> >>> not what Amazon served in response to the Python "requests.get()"!
> >>
> >> Sadly it's likely that Amazon's page is largely built from javascript.
> >> So scraping static html is probably not going to get you where you want
> >> to go.  There are heavier tools, such as Selenium that uses a real
> >> browser to grab a page, and the result of that you can parse and search
> >> perhaps.
> >
> > Or look for an API instead.
>
>
> Both +1
> However, Selenium is possibly less-manageable for a 'beginner'.
> (NB my poorly-based assumption of OP)
>
> Amazon's HTML-response actually says this/these, but I left it open as a
> (learning) exercise for the OP. They likely prefer the API approach,
> because it can be measured...
>

Yes, and because it's way WAY easier to guarantee API stability than
Selenium-based page parseability.

But even when there's no *actual* API, you can sometimes delve into
the page and find the actual useful content, perhaps as a big blob of
JSON inside a <script> tag. There'll be no guarantees, of course (but
there aren't any with parsing the HTML either), but it'll be way
easier to parse.

ChrisA


More information about the Python-list mailing list