Optimizing a text statistics function

Terry Reedy tjreedy at udel.edu
Wed Apr 21 14:48:49 EDT 2004


"Neil Benn" <benn at cenix-bioscience.com> wrote in message
news:40869EC1.8030504 at cenix-bioscience.com...
>     In other languages I've used (mainly java although some C, C# VB
> <wash out your mouth>), the way I woud look at speeding this up is to
> avoid loading all the words into memory in one go and then working upon
> them.  I'd create one stream which reads through the file, then passes
> onto a listener each word it finds from the lexing (making the input
> into tokens) and then another stream listening to this which will then
> sort out the detail from these tokens (parsing), finally an output
> stream which put this data wherever it needs to be (DB, screen, file,
> etc).  This means that the program would scale better (if you pass the
> European voting register through your system it would take exponentially
> much longer as you must scan the information twice).

You are talking about chaining iterators, which the generators and the new
iterator protocol make easier than before -- intentionally.  Something like
following (untested).

filename = 'whatever'

def wordify(source):
   for line in source:
      for word in line.split():
         yield word.strip()

def tabulate(words):
   counts = {}
   for word in words:
      counts[word] = counts.get[word,0]
   for wordcount in count.iteritems():
      yield wordcount

def disposer(wordcounts):
   for wordcount in wordcounts:
      print wordcount

disposer(tabulate(wordify(filename)))


>     However as more experienced python programmers have not suggested
> this is this because there is :
>
> a.  Something I'm not getting about python text handling

Nothing obvious.

> b. Not easy/possible in python

Wrong (see above)

c. The OPs question (speedup) was answered a half hour after posting by an
experienced P. progammer (use dict.get) -- which answer makes the
processing one-pass, which in turn makes chaining possible.

d. It has been less than 5 hours since OP posted.

e. The OP did not ask how to restructure program to make it more modular.
Thinking ahead to make code more reusable and scaleable is a second order
concern after learning basics like getting .get().  But since you brought
the subject up ...

Terry J. Reedy









More information about the Python-list mailing list