Parsing/Crawler Questions..

Thu Mar 5 12:31:23 EST 2009

hi..

the url i'm focusing on is irrelevant to the issue i'm trying to solve at
this time.

i think an approach will be to fire up a number of parsing attempts, and to
track the returned depts/classes/etc... in theory (hopefully) i should be
able to create a process to build a kind of statistical representation of
what the site looks like (names of depts, names/number of classes for given
depts, etc..) if i'm correct, this would provide a complete
"list/understanding" of what the courselist looks like....

i could then run the parsing process a number of times, examining the actual
value/results for the query, and taking the highest/oldest values for the
given query.. the idea being that the app will return correct results for
most of the queries, most of the time.. so from a statistical basis.. i can
take the results that are returned with the highest frequency...

so this approach might work. but again, haven't seen anything in the
literature/'net that talks about this...

thoughts...

thanks

-----Original Message-----
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Of John Nagle
Sent: Thursday, March 05, 2009 8:38 AM
To: python-list at python.org
Subject: Re: Parsing/Crawler Questions..

bruce wrote:
> hi john..
>
> You're missing the issue, so a little clarification...
>
> I've got a number of test parsers that point to a given classlist site..
the
> scripts work.
>
> the issue that one faces is that you never "know" if you've gotten all of
> the items/links that you're looking for based on the XPath functions. This
> could be due to an error in the parsing, or it could be due to an admin
> changing the site (removing/adding courses etc...)

    What URLs are you looking at?

					John Nagle
--
http://mail.python.org/mailman/listinfo/python-list