Word frequencies -- Python or Perl for performance?

Thu Mar 21 07:03:22 EST 2002

In article <mailman.1016223990.19235.python-list at python.org>, Nick Arnett wrote:

> Anybody have any experience generating word frequencies from short documents
> with Python and Perl?  Given a choice between the two, I'm wondering what
> will be faster.  And a related question... any idea if there will be a
> significant performance hit (or advantage?) from storing the data in MySQL
> v. my own file-based data structures?

> I'll be processing a fairly large number of short (1-6K or so) documents at
> a time, so I'll be able to batch up things quite a bit.  I'm thinking that
> the database might help me avoid loading up a lot of useless data.  Since
> word frequencies follow a Zipf distribution, I'm guessing that I can spot
> unusual words (my goal here) by loading up the top 80 percent or so of words
> in the database (by occurrences) and focusing on the words that are in the
> docs but not in the set retrieved from the database.

> Thanks for any thoughts on this and pointers to helpful examples or modules.

> Nick Arnett

 I don't know what you're really trying to do, but I decided to 
 code up a quickie "word counter" for the hell of it.

 I started with one that would simply count "words" (white space 
 separated sequences of letters, hyphens and apostrophes).  I then
 decided to also denote which of them were "known" words and keep
 a count of those as well.

 So, here's my very own version (in about 80 lines):

#!/usr/bin/env python2.2
import sys, string

class Wordcount:
	"""Keep a count of all unique "words" in text
	   Maintain a dictionary or words, each with a 
	   count of the number of occurences, 
	   Add arbitrary text to it, dump the dictionary on demand """
	# for words like o'clock and O'Holloran and fiddle-faddle
	# what should we do about contractions?
	# ditto possessive forms?
	tr = string.maketrans('','')
	rm = string.punctuation + string.digits
	rm = string.translate(rm, tr, "'-")  
		# Don't remove apostrophes and hypens from "words"
	knownWords = {}
	knownWordsRead = 0

	def __init__(self):
		self.words = {}
		self.count = 0	# total words processed
		self.nword = 0	# number of words in our instance dictionary
		self.known = 0 	# number of our words found in the class dict.
		# Each instance gets its own word list and total count
		if not Wordcount.knownWordsRead:
			# We'll try to create the "known words" dictionary
			# But we only do that on first instantiation 
			# since all instances share this one dictionary
			try:
				Wordcount.knownWordsRead = 1
				wlist = open('/usr/share/dict/words','r')
				for i in wlist:
					i = i.lower().strip()
					if not i in Wordcount.knownWords:
						Wordcount.knownWords[i] = 0
			except: pass
			# but we won't try very hard
				# print "debug: ", word 
	def add (self,text):
		for each in text.split():
			word = string.translate(each, Wordcount.tr, Wordcount.rm).lower()
			word = word.strip()
			while word.endswith("'"):   word = word[:-1] # strip quotes
			while word.startswith("'"): word = word[1:]  #
			if word.startswith('-'): continue 
			if word.endswith('-'): continue 
			if word.endswith("n't"): word = word[:-3] # can't include these
			if word.endswith("'ll"): word = word[:-3] # or you'll wonder
			if word.endswith("'s"):  word = word[:-2] # who's
			if word == '-' or word == "'" or len(word) < 1 : continue 
			self.count += 1
			if not word in self.words:
				self.words[word] = 1
				self.nword += 1
			else:
				self.words[word] += 1
			if word in Wordcount.knownWords: self.known += 1
	def dump (self):
		self.items = [ (y,x) for x,y in self.words.items() ] 
		self.items.sort()
		self.items.reverse()
		return self.items

if __name__ == '__main__':
	wcount = Wordcount()
	for i in sys.argv[1:]:
		file = open(i,"r")
		for line in file:
			wcount.add(line)
			# handle hyphenation?
			## poss. by cutting last word IFF ends in hyphen
			## and prepending to next line.
	print wcount.count, wcount.known, wcount.nword, \
		wcount.known/float(wcount.count), wcount.nword/float(wcount.known)
	for count, word in wcount.dump():
		if count > 1:		# Skip "unique" words
			if word in Wordcount.knownWords: word += "*"
			print "%7d %s" % (count, word)
	print wcount.count, wcount.known, wcount.nword, \
		wcount.known/float(wcount.count), wcount.nword/float(wcount.known)

 It's a bit crude, particularly in handling the apostrophes (vs 
 single quotes) and hyphens/dashes.  However, it seems to work
 okay.  I made NO effort to optimize it.  (I'd at least use 
 for i in file.readlines(): rather than iterating over each one
 individually, if I was concerned about speed).

 I tested by running the following commands:

 	for i in /bin/* /usr/bin/*; do 
		bname=$(basename $i); man $bname | col -b > /tmp/$bname.man
		done

	time ./wordcount.py /tmp/*.man | hea speed.

	Here's the output from that:

1602048 1361723 36978 0.849988889222 0.0271553025101
 117960 the*
  41673 to*
  36275 is*
  34975 a
  32191 of*
  27045 and*
  22881 in*
  20336 for*
  17571 be*
Traceback (most recent call last):
  File "./wordcount.py", line 81, in ?
    print "%7d %s" % (count, word)
IOError: [Errno 32] Broken pipe

real    1m48.212s
user    1m47.950s
sys     0m0.250s
$ ls /tmp/*.man | ./wc.py 
   1761    1761   31804 
$ du /tmp/*.man 
....
15836   total
$ find /tmp/*.man -printf "%s\n" \
	| awk '{n++; t+=$1}; END { print t/n ; }'
7104.33

 ... so it handled over 1700 medium size files (average 6K each,
 about 14Mb total) in less than two minutes.  Of the words I 
 counted it looks like about 84% of them were "known" words from
 /usr/share/dict/words; and it looks like I found about 2% of the
 known words.  (In other words, the Linux man pages only use about
 2% of the English vocabulary).  I doubt the top ten words from my
 list will surprise anyone: the, to, is, a, of, and, in ...

 I don't have the urge to write a version in Perl.  Not tonight
 anyway.

 Of course this script is free for any use you can think of.