Word frequencies -- Python or Perl for performance?

Jim Dennis jimd at vega.starshine.org
Thu Mar 21 17:30:44 EST 2002


In article <mailman.1016724875.29984.python-list at python.org>, Nick Arnett wrote:

>> -----Original Message-----
>> From: python-list-admin at python.org
>> [mailto:python-list-admin at python.org]On Behalf Of Jim Dennis

>[snip]

>>  I don't know what you're really trying to do, but I decided to 
>>  code up a quickie "word counter" for the hell of it.

>Wow -- thanks.  I'm going to ask questions more often now!

>Nick

 I deliberately made it a class so you could instantiate 
 multiple word counts on different blocks of text to compare them
 or whatever, and so you could import it into other programs and
 use other methods (i.e. a web spider with urllib and a "text" extractor
 with htmllib) to get your text.

 With a bit of work this could be cleaned up into a more general
 class which could then be used as the parent of some more specialized
 word counters (with different notions of acceptable character sets,
 and different semantics on handling hypens and apostrophes).  In fact
 it would make alot of sense to simply my Wordcount class and either
 use it as a base class or put a Decorator class in front of it to 
 impose all the text parsing semantics prior to calling Wordcount.add()

 I'd definitely factor the "known words" list out of this; possibly
 as its own class which could be (optionally) used by Wordcount (so
 you'd decide at instantiation if a "known words" dictionary would be
 used and (if so) which "known word" sources to use.  It would then
 be possible to use most of that to add support for "stop words"
 ("words" that would NOT be counted).  In practical usage it might be
 sensible to pickle or shelve the "known words" dictionary since it's
 moderately expensive to create it at each run-time).

 Naturally I'd also change the "dump()" method, possibly just returning
 the dictionary and delegating the sorting, filtering, and analysis
 of the results to some other class or function.

 Those bits of refactoring would be pretty simple.  Using this in
 a program that posted the results to your RDBMS should also be easy 
 enough.  




More information about the Python-list mailing list