Word frequencies -- Python or Perl for performance?

Bengt Richter bokr at oz.net
Wed Mar 20 03:48:07 EST 2002


On Fri, 15 Mar 2002 12:26:30 -0800, "Nick Arnett" <narnett at mccmedia.com> wrote:

>Anybody have any experience generating word frequencies from short documents
>with Python and Perl?  Given a choice between the two, I'm wondering what
>will be faster.  And a related question... any idea if there will be a
>significant performance hit (or advantage?) from storing the data in MySQL
>v. my own file-based data structures?
It depends on what you mean by "the data" and what use you intend for it.
What are you trying to do besides get word frequencies in various doc files?
>
>I'll be processing a fairly large number of short (1-6K or so) documents at
>a time, so I'll be able to batch up things quite a bit.  I'm thinking that
>the database might help me avoid loading up a lot of useless data.  Since
>word frequencies follow a Zipf distribution, I'm guessing that I can spot
>unusual words (my goal here) by loading up the top 80 percent or so of words
Do you want just the list of words, or do you want to index back to the
context(s) where they were used? Just getting a list of the least popular
20% of words over all is not much of a problem.

>in the database (by occurrences) and focusing on the words that are in the
>docs but not in the set retrieved from the database.
>
What are you going to do with which words?

>Thanks for any thoughts on this and pointers to helpful examples or modules.
>
Others seem to have inferred what you are up to, but I'll have to stick with
my original post until you say what you really are trying to do. It's not even
clear that you're interested in anything but global word frequencies over your
entire set of files, except for the clue that you think you may need to store
something in a database ;-)

I'm a little curious ;-)

Regards,
Bengt Richter



More information about the Python-list mailing list