[Web-SIG] Support tools for analyzing pages on the Web
Christian Wyglendowski
christian at dowski.com
Sat Feb 3 04:41:38 CET 2007
On 2/2/07, Dave Kuhlman <dkuhlman at rexx.com> wrote:
> I'd like to implement and explore tools for analyzing Web pages. I
> have in mind things like:
>
> - Tracing links from a Web page. Building a tree structure of
> links to a specified depth.
>
> - Tracing links to a Web page. Showing incoming links to a
> specified depth.
>
> - Word count, word frequency analysis, words in context, etc.
>
> - Etc.
>
> Basically, I'm interested in looking at the structure of the Web
> and trying to help make it useful.
Sounds like an interesting project.
> So, my question: Are there existing tools (in Python) of course for
> this kind of thing. I'd like (1) not to reinvent what is already
> there and (2) to make use of what already exists.
Well, for your analysis phase, I would look at the Natural Language
Tool Kit (NLTK) [1]. I haven't used it personally, but I have always
wanted to try it out. The documentation is great.
> I've done a few Web searches, but have not found that much of
> interest.
>
> I plan to start with BeautifulSoup.py at a minimum.
Maybe urllib2.urlopen + BeautifulSoup + nltk will be enough to get you
going. Post back with any cool results.
Christian
More information about the Web-SIG
mailing list