Word frequencies -- Python or Perl for performance?

Fri Mar 15 17:50:32 EST 2002

Bengt Richter wrote:
> 
> On Fri, 15 Mar 2002 12:26:30 -0800, "Nick Arnett" <narnett at mccmedia.com> wrote:
> 
> >Anybody have any experience generating word frequencies from short documents
> >with Python and Perl?  Given a choice between the two, I'm wondering what
> >will be faster.  And a related question... any idea if there will be a
> It'll take 15 minutes to get a program working in Python ;-)
> Ok, maybe a bit longer if your docs have tricky syntax to write
> a splitting regex for ;-)
> 
> >significant performance hit (or advantage?) from storing the data in MySQL
> >v. my own file-based data structures?
> >
> If all you want is to be able to get it back within another (or the same, later)
> Python program, check into the pickle module. You could just dump the directory
> of directories I describe below, I believe.
> 
> >I'll be processing a fairly large number of short (1-6K or so) documents at
> >a time, so I'll be able to batch up things quite a bit.  I'm thinking that
> >the database might help me avoid loading up a lot of useless data.  Since
> >word frequencies follow a Zipf distribution, I'm guessing that I can spot
> >unusual words (my goal here) by loading up the top 80 percent or so of words
> >in the database (by occurrences) and focusing on the words that are in the
> >docs but not in the set retrieved from the database.
> >
> >Thanks for any thoughts on this and pointers to helpful examples or modules.
> >
> I'd just code up a short Python program using a python dictionary with words
> as keys and counts as values for each file, and put all these  word frequency
> dictionaries in a file dictionary with keyed by file name.
> 
> I'd bet everything fits in memory, unless you have an incredible corpus of
> words in your files, or small memory.
> 
> Isolating words will depend on document syntax etc. You may be able to do it
> by splitting with a compiled regular expression for a combination of white
> space and punctuation.

I concur with Bengt's suggested approach, plus you might want to use
something like the Porter Stemmer algorithm to convert words to their
"base" forms eg stepped -> step

See http://www.tartarus.org/~martin/PorterStemmer/python.txt for a
Python implementation.

Tim C