Parsing/Crawler Questions - solution

Fri Mar 6 19:19:10 EST 2009

So, it sounds like your update means that it is related to a specific
url.

I'm curious about this issue myself.  I've often wondered how one
could properly crawl an AJAX-ish site when you're not sure how quickly
the data will be returned after the page has been.

John, your advice has really helped me.  Bruce / anyone else, have you
had any further experience with this type of parsing / crawling?

On Mar 5, 2:50 pm, "bruce" <bedoug... at earthlink.net> wrote:
> hi john...
>
> update...
>
> further investigation has revealed that apparently, for some urls/sites, the
> server serves up pages that take awhile to be fetched... this appears to be
> a potential problem, in that it appears that the parsescript never gets
> anything from the python mech/urllib read function.
>
> the curious issue is that i can run a single test script, pointing to the
> url, and after a bit of time.. the resulting content is fetched/downloaded
> correctly. by the way, i can get the same results in my test browsing
> environment, if i start it with only a subset of the urs that i've been
> using to test the app.
>
> hmm... might be a resource issue, a timing issue,.. or something else...
> hmmm...
>
> thanks
>
> again.... the problem i'm facing really has nothing to do with a specific
> url... the app i have for the usc site works...
>
> but for any number of reasons... you might get different results when
> running the app..
> -the server could be screwed up..
> -data might be cached
> -data might be changed, and not updated..
> -actual app problems...
> -networking issues...
> -memory corruption issues...
> -process constraint issues..
> -web server overload..
> -etc...
>
> the assumption that most people appear to make is that if you create a
> parser, and run and test it once.. then if it gets you the data, it's
> working.. when you run the same app.. 100s of times, and you're slamming the
> webserver... then you realize that that's a vastly different animal than
> simply running a snigle query a few times...
>
> so.. nope, i'm not running the app and getting data from a dynamic page that
> hasn't finished uploading/creating the content..
>
> but what my analysis is showing, not only for the usc, but for others as
> well.. is that there might be differences in what gets returned...
>
> which is where a smoothing algorithmic approach appears to be workable..
>
> i've been starting to test this approach, and it actually might have a
> chance of working...
>
> so.. as i've stated a number of times.. focusing on a specific url isn't the
> issue.. the larger issue is how you can
> programatically/algorithmically/automatically, be reasonably ensured that
> what you have is exactly what's on the site...
>
> ain't screen scraping fun!!!
>
> -----Original Message-----
> From: python-list-bounces+bedouglas=earthlink.... at python.org
>
> [mailto:python-list-bounces+bedouglas=earthlink.... at python.org]On Behalf
> Of John Nagle
> Sent: Thursday, March 05, 2009 10:54 AM
> To: python-l... at python.org
> Subject: Re: Parsing/Crawler Questions - solution
>
> Philip Semanchuk wrote:
> > On Mar 5, 2009, at 12:31 PM, bruce wrote:
>
> >> hi..
>
> >> the url i'm focusing on is irrelevant to the issue i'm trying to solve at
> >> this time.
>
> > Not if we're to understand the situation you're trying to describe. From
> > what I can tell, you're saying that the target site displays different
> > results each time your crawler visits it. It's as if e.g. the site knows
> > about 100 courses but only displays 80 randomly chosen ones to each
> > visitor. If that's the case, then it is truly bizarre.
>
>      Agreed.  The course list isn't changing that rapidly.
>
>      I suspect the original poster is doing something like reading the DOM
> of a dynamic page while the page is still updating, running a browser
> in a subprocess.  Is that right?
>
>      I've had to deal with that in Javascript.  My AdRater browser plug-in
> (http://www.sitetruth.com/downloads) looks at Google-served ads and
> rates the advertisers.   There, I have to watch for page-change events
> and update the annotations I'm adding to ads.
>
>      But you don't need to work that hard here. The USC site is actually
> querying a server which provides the requested data in JSON format.  See
>
>        http://web-app.usc.edu/soc/dev/scripts/soc.js
>
> Reverse-engineer that and you'll be able to get the underlying data.
> (It's an amusing script; many little fixes to data items are performed,
> something that should have been done at the database front end.)
>
> The way to get USC class data is this:
>
> 1.  Start here: "http://web-app.usc.edu/soc/term_20091.html"
> 2.  Examine all the department pages under that page.
> 3.  On each page, look for the value of "coursesrc", like this:
>         var coursesrc = '/ws/soc/api/classes/aest/20091'
> 4.  For each "coursesrc" value found, construct a URL like this:
>        http://web-app.usc.edu/ws/soc/api/classes/aest/20091
> 5.  Read that URL.  This will return the department's course list in
>      JSON format.
> 6.  From the JSON tree, pull out CourseData items, which look like this:
>
> CourseData":
> {"prefix":"AEST",
> "number":"220",
> "sequence":"B",
> "suffix":{},
> "title":"Advanced Leadership Laboratory II",
> "description":"Additional exposure to the military experience for continuing
> AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and
> the
> environment of an Air Force officer. Credit\/No Credit.",
> "units":"1",
> "restriction_by_major":{},
> "restriction_by_class":{},
> "restriction_by_school":{},
> "CourseNotes":{},
> "CourseTermNotes":{},
> "prereq_text":"AEST-220A",
> "coreq_text":{},
> "SectionData":{"id":"41799",
> "session":"790",
> "dclass_code":"D",
> "title":"Advanced Leadership Laboratory II",
> "section_title":{},
> "description":{},
> "notes":{},
> "type":"Lec",
> "units":"1",
> "spaces_available":"30",
> "number_registered":"2",
> "wait_qty":"0",
> "canceled":"N",
> "blackboard":"Y",
> "comment":{},
> "day":{},"start_time":"TBA",
> "end_time":"TBA",
> "location":"OFFICE",
> "instructor":{"last_name":"Hampton","first_name":"Daniel"},
> "syllabus":{"format":{},"filesize":{}},
> "IsDistanceLearning":"N"}}},
>
> Parsing the JSON is left as an exercise for the student.  (There's
> a Python module for that.)
>
> And no, the data isn't changing; you can read those pages of JSON over and
> over and get the same data every time.
>
>                                         John Nagle
> --http://mail.python.org/mailman/listinfo/python-list
>
>