Word frequencies -- Python or Perl for performance?

Nick Arnett narnett at mccmedia.com
Wed Mar 20 12:36:21 EST 2002


> -----Original Message-----
> From: python-list-admin at python.org
> [mailto:python-list-admin at python.org]On Behalf Of Bengt Richter

[snip]

> Others seem to have inferred what you are up to, but I'll have to
> stick with
> my original post until you say what you really are trying to do.

All is not yet clear to me yet...!  But this is related to identifying
relatively distinct ideas in on-line discussions.  I'm generating feature
vectors to spot when a conversation forks, so to speak.  The message
meta-data and bodies are stored in MySQL, so it would be somewhat of a
natural for me to stick word frequency data in a table.

I just realized that the version of MySQL I'm using has full-text indexing
(I had thought it wasn't showing up until 4.0, which is not ready for
production use yet).  However, there are some default behaviors in MySQL's
full-text search that I don't like... and I'm not inclined at the moment to
compile my own MySQL binary to change them, especially since I still won't
get proximity and other capabilities.

I suppose I should have asked about full-text search packages that will
accomplish the same things, but I know too darn much about them and I want
to get at the inverted indexes themselves.  There's some unusual search
weighting that I want to be able to do eventually.  If there's an open
source search engine that would let me add an externally calculated
weighting factor, that would be terrific.

> It's not even
> clear that you're interested in anything but global word
> frequencies over your
> entire set of files, except for the clue that you think you may
> need to store
> something in a database ;-)

I'm also interested in how the word frequencies change over time.  As an
example of why this is interesting, I took a look at them in Usenet postings
after the Columbine shootings a few years ago.  Spotting the features whose
frequencies change fastest gives an idea of where the discussion is going.
For example, "video games" showed up at high frequency initially, but
dropped rapidly, while "parents" rose.  Once you know the features that are
changing rapidly, you can go grab the sentences that contain the greatest
number of them, which yields a decent summary of the ideas that are moving,
so to speak, in a discussion.

The really, really (unreachably!) big picture here is that, as Tim O'Reilly
and I like to say, I'm trying to figure out what the Internet is thinking
today.  I did a lot of my initial brainstorming on this stuff with the
O'Reilly folks, who use this sort of analysis to get a handle on which open
source software is gaining momentum and thus might deserve a book.

Nick





More information about the Python-list mailing list