Word frequencies -- Python or Perl for performance?
Jim Dennis
jimd at vega.starshine.org
Thu Mar 21 17:30:44 EST 2002
In article <mailman.1016724875.29984.python-list at python.org>, Nick Arnett wrote:
>> -----Original Message-----
>> From: python-list-admin at python.org
>> [mailto:python-list-admin at python.org]On Behalf Of Jim Dennis
>[snip]
>> I don't know what you're really trying to do, but I decided to
>> code up a quickie "word counter" for the hell of it.
>Wow -- thanks. I'm going to ask questions more often now!
>Nick
I deliberately made it a class so you could instantiate
multiple word counts on different blocks of text to compare them
or whatever, and so you could import it into other programs and
use other methods (i.e. a web spider with urllib and a "text" extractor
with htmllib) to get your text.
With a bit of work this could be cleaned up into a more general
class which could then be used as the parent of some more specialized
word counters (with different notions of acceptable character sets,
and different semantics on handling hypens and apostrophes). In fact
it would make alot of sense to simply my Wordcount class and either
use it as a base class or put a Decorator class in front of it to
impose all the text parsing semantics prior to calling Wordcount.add()
I'd definitely factor the "known words" list out of this; possibly
as its own class which could be (optionally) used by Wordcount (so
you'd decide at instantiation if a "known words" dictionary would be
used and (if so) which "known word" sources to use. It would then
be possible to use most of that to add support for "stop words"
("words" that would NOT be counted). In practical usage it might be
sensible to pickle or shelve the "known words" dictionary since it's
moderately expensive to create it at each run-time).
Naturally I'd also change the "dump()" method, possibly just returning
the dictionary and delegating the sorting, filtering, and analysis
of the results to some other class or function.
Those bits of refactoring would be pretty simple. Using this in
a program that posted the results to your RDBMS should also be easy
enough.
More information about the Python-list
mailing list