Word frequencies -- Python or Perl for performance?

Bengt Richter bokr at oz.net
Fri Mar 15 17:31:23 EST 2002


On Fri, 15 Mar 2002 12:26:30 -0800, "Nick Arnett" <narnett at mccmedia.com> wrote:

>Anybody have any experience generating word frequencies from short documents
>with Python and Perl?  Given a choice between the two, I'm wondering what
>will be faster.  And a related question... any idea if there will be a
It'll take 15 minutes to get a program working in Python ;-)
Ok, maybe a bit longer if your docs have tricky syntax to write
a splitting regex for ;-)

>significant performance hit (or advantage?) from storing the data in MySQL
>v. my own file-based data structures?
>
If all you want is to be able to get it back within another (or the same, later)
Python program, check into the pickle module. You could just dump the directory
of directories I describe below, I believe.

>I'll be processing a fairly large number of short (1-6K or so) documents at
>a time, so I'll be able to batch up things quite a bit.  I'm thinking that
>the database might help me avoid loading up a lot of useless data.  Since
>word frequencies follow a Zipf distribution, I'm guessing that I can spot
>unusual words (my goal here) by loading up the top 80 percent or so of words
>in the database (by occurrences) and focusing on the words that are in the
>docs but not in the set retrieved from the database.
>
>Thanks for any thoughts on this and pointers to helpful examples or modules.
>
I'd just code up a short Python program using a python dictionary with words
as keys and counts as values for each file, and put all these  word frequency
dictionaries in a file dictionary with keyed by file name.

I'd bet everything fits in memory, unless you have an incredible corpus of
words in your files, or small memory.

Isolating words will depend on document syntax etc. You may be able to do it
by splitting with a compiled regular expression for a combination of white
space and punctuation.


Regards,
Bengt Richter




More information about the Python-list mailing list