Word frequencies -- Python or Perl for performance?
Jim Dennis
jimd at vega.starshine.org
Thu Mar 21 07:03:22 EST 2002
In article <mailman.1016223990.19235.python-list at python.org>, Nick Arnett wrote:
> Anybody have any experience generating word frequencies from short documents
> with Python and Perl? Given a choice between the two, I'm wondering what
> will be faster. And a related question... any idea if there will be a
> significant performance hit (or advantage?) from storing the data in MySQL
> v. my own file-based data structures?
> I'll be processing a fairly large number of short (1-6K or so) documents at
> a time, so I'll be able to batch up things quite a bit. I'm thinking that
> the database might help me avoid loading up a lot of useless data. Since
> word frequencies follow a Zipf distribution, I'm guessing that I can spot
> unusual words (my goal here) by loading up the top 80 percent or so of words
> in the database (by occurrences) and focusing on the words that are in the
> docs but not in the set retrieved from the database.
> Thanks for any thoughts on this and pointers to helpful examples or modules.
> Nick Arnett
I don't know what you're really trying to do, but I decided to
code up a quickie "word counter" for the hell of it.
I started with one that would simply count "words" (white space
separated sequences of letters, hyphens and apostrophes). I then
decided to also denote which of them were "known" words and keep
a count of those as well.
So, here's my very own version (in about 80 lines):
#!/usr/bin/env python2.2
import sys, string
class Wordcount:
"""Keep a count of all unique "words" in text
Maintain a dictionary or words, each with a
count of the number of occurences,
Add arbitrary text to it, dump the dictionary on demand """
# for words like o'clock and O'Holloran and fiddle-faddle
# what should we do about contractions?
# ditto possessive forms?
tr = string.maketrans('','')
rm = string.punctuation + string.digits
rm = string.translate(rm, tr, "'-")
# Don't remove apostrophes and hypens from "words"
knownWords = {}
knownWordsRead = 0
def __init__(self):
self.words = {}
self.count = 0 # total words processed
self.nword = 0 # number of words in our instance dictionary
self.known = 0 # number of our words found in the class dict.
# Each instance gets its own word list and total count
if not Wordcount.knownWordsRead:
# We'll try to create the "known words" dictionary
# But we only do that on first instantiation
# since all instances share this one dictionary
try:
Wordcount.knownWordsRead = 1
wlist = open('/usr/share/dict/words','r')
for i in wlist:
i = i.lower().strip()
if not i in Wordcount.knownWords:
Wordcount.knownWords[i] = 0
except: pass
# but we won't try very hard
# print "debug: ", word
def add (self,text):
for each in text.split():
word = string.translate(each, Wordcount.tr, Wordcount.rm).lower()
word = word.strip()
while word.endswith("'"): word = word[:-1] # strip quotes
while word.startswith("'"): word = word[1:] #
if word.startswith('-'): continue
if word.endswith('-'): continue
if word.endswith("n't"): word = word[:-3] # can't include these
if word.endswith("'ll"): word = word[:-3] # or you'll wonder
if word.endswith("'s"): word = word[:-2] # who's
if word == '-' or word == "'" or len(word) < 1 : continue
self.count += 1
if not word in self.words:
self.words[word] = 1
self.nword += 1
else:
self.words[word] += 1
if word in Wordcount.knownWords: self.known += 1
def dump (self):
self.items = [ (y,x) for x,y in self.words.items() ]
self.items.sort()
self.items.reverse()
return self.items
if __name__ == '__main__':
wcount = Wordcount()
for i in sys.argv[1:]:
file = open(i,"r")
for line in file:
wcount.add(line)
# handle hyphenation?
## poss. by cutting last word IFF ends in hyphen
## and prepending to next line.
print wcount.count, wcount.known, wcount.nword, \
wcount.known/float(wcount.count), wcount.nword/float(wcount.known)
for count, word in wcount.dump():
if count > 1: # Skip "unique" words
if word in Wordcount.knownWords: word += "*"
print "%7d %s" % (count, word)
print wcount.count, wcount.known, wcount.nword, \
wcount.known/float(wcount.count), wcount.nword/float(wcount.known)
It's a bit crude, particularly in handling the apostrophes (vs
single quotes) and hyphens/dashes. However, it seems to work
okay. I made NO effort to optimize it. (I'd at least use
for i in file.readlines(): rather than iterating over each one
individually, if I was concerned about speed).
I tested by running the following commands:
for i in /bin/* /usr/bin/*; do
bname=$(basename $i); man $bname | col -b > /tmp/$bname.man
done
time ./wordcount.py /tmp/*.man | hea speed.
Here's the output from that:
1602048 1361723 36978 0.849988889222 0.0271553025101
117960 the*
41673 to*
36275 is*
34975 a
32191 of*
27045 and*
22881 in*
20336 for*
17571 be*
Traceback (most recent call last):
File "./wordcount.py", line 81, in ?
print "%7d %s" % (count, word)
IOError: [Errno 32] Broken pipe
real 1m48.212s
user 1m47.950s
sys 0m0.250s
$ ls /tmp/*.man | ./wc.py
1761 1761 31804
$ du /tmp/*.man
....
15836 total
$ find /tmp/*.man -printf "%s\n" \
| awk '{n++; t+=$1}; END { print t/n ; }'
7104.33
... so it handled over 1700 medium size files (average 6K each,
about 14Mb total) in less than two minutes. Of the words I
counted it looks like about 84% of them were "known" words from
/usr/share/dict/words; and it looks like I found about 2% of the
known words. (In other words, the Linux man pages only use about
2% of the English vocabulary). I doubt the top ten words from my
list will surprise anyone: the, to, is, a, of, and, in ...
I don't have the urge to write a version in Perl. Not tonight
anyway.
Of course this script is free for any use you can think of.
More information about the Python-list
mailing list