Word frequencies -- Python or Perl for performance?

Aahz aahz at pythoncraft.com
Tue Mar 19 23:10:09 EST 2002


In article <mailman.1016223990.19235.python-list at python.org>,
Nick Arnett <narnett at mccmedia.com> wrote:
>
>Anybody have any experience generating word frequencies from short
>documents with Python and Perl?  Given a choice between the two, I'm
>wondering what will be faster.  And a related question... any idea
>if there will be a significant performance hit (or advantage?) from
>storing the data in MySQL v. my own file-based data structures?
>
>I'll be processing a fairly large number of short (1-6K or so)
>documents at a time, so I'll be able to batch up things quite a bit.
>I'm thinking that the database might help me avoid loading up a lot of
>useless data.  Since word frequencies follow a Zipf distribution, I'm
>guessing that I can spot unusual words (my goal here) by loading up
>the top 80 percent or so of words in the database (by occurrences) and
>focusing on the words that are in the docs but not in the set retrieved
>from the database.

Well, well, well, long time no see.  Why not just use Verity?  ;-)

Seriously, for this kind of work, it's quite likely that Perl can be
coded to be a bit faster than Python, but if you're expecting to need to
do a lot of iterative work on your algorithms, programmer time will
count for a lot, and Python will probably win there.

Given that it sounds like you want to create your own inverted word
index and do some sorting/searching based on word counts, it'll be hard
to get more bang for the buck than a real database.  Unless you're on a
shoestring, consider getting a commercial database; you should probably
also check to see whether MySQL or PostgreSQL will give you better
performance.
-- 
Aahz (aahz at pythoncraft.com)           <*>         http://www.pythoncraft.com/

The best way to get information on Usenet is not to ask a question, but
to post the wrong information.  --Aahz



More information about the Python-list mailing list