Word frequencies -- Python or Perl for performance?
Aahz
aahz at pythoncraft.com
Tue Mar 19 23:10:09 EST 2002
In article <mailman.1016223990.19235.python-list at python.org>,
Nick Arnett <narnett at mccmedia.com> wrote:
>
>Anybody have any experience generating word frequencies from short
>documents with Python and Perl? Given a choice between the two, I'm
>wondering what will be faster. And a related question... any idea
>if there will be a significant performance hit (or advantage?) from
>storing the data in MySQL v. my own file-based data structures?
>
>I'll be processing a fairly large number of short (1-6K or so)
>documents at a time, so I'll be able to batch up things quite a bit.
>I'm thinking that the database might help me avoid loading up a lot of
>useless data. Since word frequencies follow a Zipf distribution, I'm
>guessing that I can spot unusual words (my goal here) by loading up
>the top 80 percent or so of words in the database (by occurrences) and
>focusing on the words that are in the docs but not in the set retrieved
>from the database.
Well, well, well, long time no see. Why not just use Verity? ;-)
Seriously, for this kind of work, it's quite likely that Perl can be
coded to be a bit faster than Python, but if you're expecting to need to
do a lot of iterative work on your algorithms, programmer time will
count for a lot, and Python will probably win there.
Given that it sounds like you want to create your own inverted word
index and do some sorting/searching based on word counts, it'll be hard
to get more bang for the buck than a real database. Unless you're on a
shoestring, consider getting a commercial database; you should probably
also check to see whether MySQL or PostgreSQL will give you better
performance.
--
Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/
The best way to get information on Usenet is not to ask a question, but
to post the wrong information. --Aahz
More information about the Python-list
mailing list