Parsing/Crawler Questions - solution

John Nagle nagle at animats.com
Thu Mar 5 13:54:07 EST 2009


Philip Semanchuk wrote:
> On Mar 5, 2009, at 12:31 PM, bruce wrote:
> 
>> hi..
>>
>> the url i'm focusing on is irrelevant to the issue i'm trying to solve at
>> this time.
> 
> Not if we're to understand the situation you're trying to describe. From 
> what I can tell, you're saying that the target site displays different 
> results each time your crawler visits it. It's as if e.g. the site knows 
> about 100 courses but only displays 80 randomly chosen ones to each 
> visitor. If that's the case, then it is truly bizarre.

     Agreed.  The course list isn't changing that rapidly.

     I suspect the original poster is doing something like reading the DOM
of a dynamic page while the page is still updating, running a browser
in a subprocess.  Is that right?

     I've had to deal with that in Javascript.  My AdRater browser plug-in
(http://www.sitetruth.com/downloads) looks at Google-served ads and
rates the advertisers.   There, I have to watch for page-change events
and update the annotations I'm adding to ads.

     But you don't need to work that hard here. The USC site is actually
querying a server which provides the requested data in JSON format.  See

	http://web-app.usc.edu/soc/dev/scripts/soc.js

Reverse-engineer that and you'll be able to get the underlying data.
(It's an amusing script; many little fixes to data items are performed,
something that should have been done at the database front end.)

The way to get USC class data is this:

1.  Start here: "http://web-app.usc.edu/soc/term_20091.html"
2.  Examine all the department pages under that page.
3.  On each page, look for the value of "coursesrc", like this:
	var coursesrc = '/ws/soc/api/classes/aest/20091'
4.  For each "coursesrc" value found, construct a URL like this:
	http://web-app.usc.edu/ws/soc/api/classes/aest/20091
5.  Read that URL.  This will return the department's course list in
     JSON format.
6.  From the JSON tree, pull out CourseData items, which look like this:

CourseData":
{"prefix":"AEST",
"number":"220",
"sequence":"B",
"suffix":{},
"title":"Advanced Leadership Laboratory II",
"description":"Additional exposure to the military experience for continuing 
AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and the 
environment of an Air Force officer. Credit\/No Credit.",
"units":"1",
"restriction_by_major":{},
"restriction_by_class":{},
"restriction_by_school":{},
"CourseNotes":{},
"CourseTermNotes":{},
"prereq_text":"AEST-220A",
"coreq_text":{},
"SectionData":{"id":"41799",
"session":"790",
"dclass_code":"D",
"title":"Advanced Leadership Laboratory II",
"section_title":{},
"description":{},
"notes":{},
"type":"Lec",
"units":"1",
"spaces_available":"30",
"number_registered":"2",
"wait_qty":"0",
"canceled":"N",
"blackboard":"Y",
"comment":{},
"day":{},"start_time":"TBA",
"end_time":"TBA",
"location":"OFFICE",
"instructor":{"last_name":"Hampton","first_name":"Daniel"},
"syllabus":{"format":{},"filesize":{}},
"IsDistanceLearning":"N"}}},

Parsing the JSON is left as an exercise for the student.  (There's
a Python module for that.)

And no, the data isn't changing; you can read those pages of JSON over and
over and get the same data every time.

					John Nagle



More information about the Python-list mailing list