Word frequencies -- Python or Perl for performance?

Nick Arnett narnett at mccmedia.com
Thu Mar 21 19:21:58 EST 2002


> -----Original Message-----
> From: python-list-admin at python.org
> [mailto:python-list-admin at python.org]On Behalf Of Jim Dennis

...

>  I deliberately made it a class so you could instantiate
>  multiple word counts on different blocks of text to compare them
>  or whatever, and so you could import it into other programs and
>  use other methods (i.e. a web spider with urllib and a "text" extractor
>  with htmllib) to get your text.

Yep -- that's where I'm headed.

>  With a bit of work this could be cleaned up into a more general
>  class which could then be used as the parent of some more specialized
>  word counters (with different notions of acceptable character sets,
>  and different semantics on handling hypens and apostrophes).  In fact
>  it would make alot of sense to simply my Wordcount class and either
>  use it as a base class or put a Decorator class in front of it to
>  impose all the text parsing semantics prior to calling Wordcount.add()

When I get it into good, flexible shape, I'll make it publicly available.
Even though I don't think I want stemming, it'll be easy to optionally call
a stemming module.

>  I'd definitely factor the "known words" list out of this; possibly
>  as its own class which could be (optionally) used by Wordcount (so
>  you'd decide at instantiation if a "known words" dictionary would be
>  used and (if so) which "known word" sources to use.  It would then
>  be possible to use most of that to add support for "stop words"
>  ("words" that would NOT be counted).  In practical usage it might be
>  sensible to pickle or shelve the "known words" dictionary since it's
>  moderately expensive to create it at each run-time).

Definitely.  For what I'm doing, almost everything is a stop word.  I'm
looking for outliers, trying to distinguish between those that are just
freakish (random misspellings, etc.) and those that are actually emerging as
significant.  It's very interesting what can automatically emerge if you can
spot outliers that are converging inward over time, so to speak.

>  Those bits of refactoring would be pretty simple.  Using this in
>  a program that posted the results to your RDBMS should also be easy
>  enough.

Most certainly.

Thanks again.

Nick





More information about the Python-list mailing list