Optimizing a text statistics function
Terry Reedy
tjreedy at udel.edu
Wed Apr 21 14:48:49 EDT 2004
"Neil Benn" <benn at cenix-bioscience.com> wrote in message
news:40869EC1.8030504 at cenix-bioscience.com...
> In other languages I've used (mainly java although some C, C# VB
> <wash out your mouth>), the way I woud look at speeding this up is to
> avoid loading all the words into memory in one go and then working upon
> them. I'd create one stream which reads through the file, then passes
> onto a listener each word it finds from the lexing (making the input
> into tokens) and then another stream listening to this which will then
> sort out the detail from these tokens (parsing), finally an output
> stream which put this data wherever it needs to be (DB, screen, file,
> etc). This means that the program would scale better (if you pass the
> European voting register through your system it would take exponentially
> much longer as you must scan the information twice).
You are talking about chaining iterators, which the generators and the new
iterator protocol make easier than before -- intentionally. Something like
following (untested).
filename = 'whatever'
def wordify(source):
for line in source:
for word in line.split():
yield word.strip()
def tabulate(words):
counts = {}
for word in words:
counts[word] = counts.get[word,0]
for wordcount in count.iteritems():
yield wordcount
def disposer(wordcounts):
for wordcount in wordcounts:
print wordcount
disposer(tabulate(wordify(filename)))
> However as more experienced python programmers have not suggested
> this is this because there is :
>
> a. Something I'm not getting about python text handling
Nothing obvious.
> b. Not easy/possible in python
Wrong (see above)
c. The OPs question (speedup) was answered a half hour after posting by an
experienced P. progammer (use dict.get) -- which answer makes the
processing one-pass, which in turn makes chaining possible.
d. It has been less than 5 hours since OP posted.
e. The OP did not ask how to restructure program to make it more modular.
Thinking ahead to make code more reusable and scaleable is a second order
concern after learning basics like getting .get(). But since you brought
the subject up ...
Terry J. Reedy
More information about the Python-list
mailing list