[Tutor] Simple counter to determine frequencies of words in a document

Sat Nov 20 20:48:44 CET 2010

Thanks Alan, Peter and Steve,

Instead of answering each one of you independently let me try to use
my response to Steve's message as the basis for an answer to all of
you.

It turns out that matters of efficiency appear to be VERY important in
this case. The example in my message was a very short string but the
file that I'm trying to process is pretty big (20MB of text).

I'm writing to you as my computer is about to burst in flames. I'm
exaggerating a little bit because I'm checking the temperature and
things so far seem to be under control but I ran the script that I
made up following your recommendations (see below) on the real file
for which I wanted to get word frequencies and it has been running for
over half an hour without having generated the output file yet. I'm
using a pretty powerful computer (core i7 with 8GB of RAM) so I'm a
little surprised (and a bit worried as well) that the process hasn't
finished yet. I tested the script before with a much smaller file and
the output was as desired.

When I look at the current processes running on my computer, I see the
Python process taking 100% of the CPU. Since my computer has a
multi-core processor, I'm assuming this process is using only one of
the cores because another monitor tells me that the CPU usage is under
20%.  This doesn't make much sense to me. I bought a computer with a
powerful CPU precisely to do these kinds of things as fast as
possible. How can it be that Python is only using such a small amount
of processing power? But I digress, I will start another thread to ask
about this because I'm curious to know whether this can be changed in
any way. Now, however, I'm more interested in getting the right answer
to my original question.

OK, I'll start with Steve's answer first.

> When you run that code, are you SURE that it merely results in the output
> file being blank? When I run it, I get an obvious error:
>
> Traceback (most recent call last):
>  File "<stdin>", line 4, in <module>
> TypeError: argument 1 must be string or read-only character buffer, not list
>
> Don't you get this error too?

Nope. I was surprised myself, but I did not get any errors. But I
suspect that this is because I don't have my IDE well configured.
Although (see below) I do get many other error messages, I didn't get
any in this case. See, I'm not only a newbie in Python but a newbie
with IDEs as well. I'm using Eclipse (probably I should have started
with something smaller and simpler) and I see the following error
message:

--------------------
Pylint: Executing command line:'
/Applications/eclipse/Eclipse.app/Contents/MacOS --include-ids=y
/Volumes/DATA/Documents/workspace/GCA/src/prova.py 'Pylint: The stdout
of the command line is: Pylint: The stderr of the command line is:
/usr/bin/python: can't find '__main__.py' in
'/Applications/eclipse/Eclipse.app/Contents/MacOS'
-----------------

Anyway, I tried the different alternatives all of you suggested with a
small test file and everything worked perfectly. With the big file,
however, none of the alternatives seems to work. Well, I don't know
whether they work or not because the process takes so long that I have
had to kill it out of desperation. The process I talk about at the
beginning of this message is the one involving Peter's alternative. I
think I'm going to kill it as well because now it has been running for
45 minutes and this seems way too long.

So, here is how I wrote the code. You'll see that there are two
different functions that do the same thing: countWords(wordlist) and
countWords2(wordlist). countWords2 is adapted from Peter Otten's
suggestion. This was the one that according to him would be more
efficient. However, none of the versions (including Alan's as well)
work when the file being processed is a large file.

def countWords(wordlist):
    word_table = {}
    for word in wordlist:
        count = wordlist.count(word)
        word_table[word] = count
def countWords2(wordlist): #as proposed by Peter Otten
    word_table = {}
    for word in wordlist:
        if word in word_table:
            word_table[word] += 1
        else:
            word_table[word] = 1
        count = wordlist.count(word)
        word_table[word] = count
    return sorted(
                  word_table.items(), key=lambda item: item[1], reverse=True
                  )
def getWords(filename):
    with open(filename, 'r') as f:
        words = f.read().split()
    return words
def writeTable(filename, table):
    with open(filename, 'w') as f:
        for word, count in table:
            f.write("%s\t%s\n" % (word, count))
words = getWords('tokens_short.txt')
table = countWords(words) # or table = countWords2(words)
writeTable('output.txt', table)

> For bonus points, you might want to think about why countWords will be so
> inefficient for large word lists, although you probably won't see any
> problems until you're dealing with thousands or tens of thousands of words.

Well, now it will be clear to you that I AM seeing big problems
because the files I need to process contain tens of thousands of
words. The reason it is inefficient, I'm guessing, is because you have
to repeat the counting of how many times a word appears in the list
every time you encounter the same word in the loop.  This is more or
less what Peter said of the solution proposed by Alan, right?

However, even with countWords2, which is supposed to overcome this
problem, it feels as if I've entered an infinite loop.

Josep M.

> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>