[Web-SIG] Support tools for analyzing pages on the Web

Christian Wyglendowski christian at dowski.com
Sat Feb 3 04:41:38 CET 2007


On 2/2/07, Dave Kuhlman <dkuhlman at rexx.com> wrote:
> I'd like to implement and explore tools for analyzing Web pages.  I
> have in mind things like:
>
> - Tracing links from a Web page.  Building a tree structure of
>   links to a specified depth.
>
> - Tracing links to a Web page.  Showing incoming links to a
>   specified depth.
>
> - Word count, word frequency analysis, words in context, etc.
>
> - Etc.
>
> Basically, I'm interested in looking at the structure of the Web
> and trying to help make it useful.

Sounds like an interesting project.

> So, my question: Are there existing tools (in Python) of course for
> this kind of thing.  I'd like (1) not to reinvent what is already
> there and (2) to make use of what already exists.

Well, for your analysis phase, I would look at the Natural Language
Tool Kit (NLTK) [1].  I haven't used it personally, but I have always
wanted to try it out.  The documentation is great.

> I've done a few Web searches, but have not found that much of
> interest.
>
> I plan to start with BeautifulSoup.py at a minimum.

Maybe urllib2.urlopen + BeautifulSoup + nltk will be enough to get you
going.  Post back with any cool results.

Christian


More information about the Web-SIG mailing list