trying to begin a code for web scraping

Tue Feb 19 08:31:01 EST 2019

On Tue, Feb 19, 2019 at 12:52 AM Drake Gossi <drake.gossi at gmail.com> wrote:
>
> Hi everyone,
>
> I'm trying to write code to scrape this website
> <https://www.regulations.gov/document?D=ED-2018-OCR-0064-5403> (
> regulations.gov) of its comments, but I'm having trouble figuring out what
> to link onto in the inspect page (like when I right click on inspect with
> the mouse).
>
> Although I need to write code to scrape all 11,000ish of the comments
> related to this event (by putting a code in a loop?), I'm still at the
> stage of looking at individual comments. So, for example, with this comment
> <https://www.regulations.gov/document?D=ED-2018-OCR-0064-5403>, I know
> enough to right click on inspect and to look at the xml? (This is how much
> of a beginner I am--what am I looking at when I right click inspect?) Then,
> I control F to find where the comment is in the code. For that comment, the
> word I used control F on was "troubling." So, I found the comment buried in
> the xml
>
> But my issue is this. I don't know what to link onto to scrape the comment
> (and I assume that this same sequence of letters would apply to scraping
> all of the comments in general). I assume what I grab is GIY1LSJISD. I'm
> watching this video, and the person is linking onto "tr" and "td," but mine
> is not that easy. In other words, what is the most essential language (bit
> of xml? code), the copying of which would allow me to extract not only this
> comment, but all of the comments, were I to put this bit of language(/xml?)
> my code? ... ... soup.findALL ('?')
>
> In sum, what I need to know is, how do I tell my Python code to ignore all
> of the surrounding code and go straight in and grab the comment. Of course,
> I need to grab other things too like the name, category, date, and so on,
> but I haven't gotten that far yet. Right now, I'm just trying to figure out
> what I need to insert into my code so that I can get the comment.
>
> Help! I'm trying to learn code on the fly. I'm an experienced researcher
> but am new to coding. Any help you could give me would be tremendously
> awesome.
>
> Best,
> Drake
> --
> https://mail.python.org/mailman/listinfo/python-list

Beautiful soup is your friend here.  It can analyze the data within
the html tags on your scraped page.  But often javascript is used on
'modern' web pages so the page is actually not just html, but
javascript that changes the html.  For this you need another tool -- i
think one is called scrapy.  Others here probably have experience with
that.

Show a small snippet of your code that demonstrates at least one of
your coding problems.
-- 
Joel Goldstick
http://joelgoldstick.com/blog
http://cc-baseballstats.info/stats/birthdays