[Tutor] Getting total counts

Sat Oct 2 03:13:04 CEST 2010

On Sat, 2 Oct 2010 06:31:42 am aeneas24 at priest.com wrote:
> Hi,
>
> I have created a csv file that lists how often each word in the
> Internet Movie Database occurs with different star-ratings and in
> different genres.

I would have thought that IMDB would probably have already made that 
information available?

http://www.imdb.com/interfaces

> The input file looks something like this--since 
> movies can have multiple genres, there are three genre rows. (This is
> fake, simplified data.)
[...]
> I can get the program to tell me how many occurrence of "the" there
> are in Thrillers (50), how many "the"'s in 1-stars (50), and how many
> 1-star drama "the"'s there are (30). But I need to be able to expand
> beyond a particular word and say "how many words total are in
> "Drama"? How many total words are in 1-star ratings? How many words
> are there in the whole corpus? On these all-word totals, I'm stumped.

The headings of your data look like this:

ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count

and you want to map words to genres. Can you tell us how big the CSV 
file is? Depending on its size, you may need to use on-disk storage 
(perhaps shelve, as you're already doing) but for illustration purposes 
I'll assume it all fits in memory and just use regular dicts. I'm going 
to create a table that stores the counts for each word versus the 
genre:

Genre    | the | scary | silly | exciting | ... 
------------------------------------------------
Western  | 934 |   3   |   5   |    256   |
Thriller | 899 |  145  |   84  |    732   |
Comedy   | 523 |   1   |  672  |     47   |
...

To do this using dicts, I'm going to use a dict for genres:

genre_table = {"Western": table_of_words, ...}

and each table_of_words will look like:

{'the': 934, 'scary': 3, 'silly': 5, ...}

Let's start with a helper function and table to store the data.

# Initialise the table.
genres = {}

def add_word(genre, word, count):
    genre = genre.title()  # force "gEnRe" to "Genre"
    word = word.lower()  # force "wOrD" to "word"
    count = int(count)
    row = genres.get(genre, {})
    n = row.get(word, 0)
    row[word] = n + count
    genres[genre] = row

We can simplify this code by using the confusingly named, but useful, 
setdefault method of dicts:

def add_word(genre, word, count):
    genre = genre.title()
    word = word.lower()
    count = int(count)
    row = genres.setdefault(genre, {})
    row[word] = row.get(word, 0) + count

Now let's process the CSV file. I'm afraid I can't remember how the CSV 
module works, and I'm too lazy to look it up, so this is pseudo-code 
rather than Python:

for row in csv file:
    genre1 = get column Genre1
    genre2 = get column Genre2
    genre3 = get column Genre3
    word = get column Word
    count = get column Count
    add_word(genre1, word, count)
    add_word(genre2, word, count)
    add_word(genre3, word, count)

Now we can easily query our table for useful information:

# list of unique words for the Western genre
genres["Western"].keys()  
# count of unique words for the Romance genre
len(genres["Romance"])  
# number of times "underdog" is used in Sports movies
genres["Sport"]["underdog"]
# total count of words for the Comedy genre
sum(genres["Comedy"].values())

Do you want to do lookups efficiently the other way as well? It's easy 
to add another table:

Word  | Western | Thriller | ... 
------------------------------------------------
the   |   934   |   899    |
scary |    3    |   145    |
...

Add a second global table:

genres = {}
words = {}

and modify the helper function:

def add_word(genre, word, count):
    genre = genre.title()
    word = word.lower()
    count = int(count)
    # Add word to the genres table.
    row = genres.setdefault(genre, {})
    row[word] = row.get(word, 0) + count
    # And to the words table.
    row = words.setdefault(word, {})
    row[genre] = row.get(genre, 0) + count

-- 
Steven D'Aprano