[Tutor] Getting total counts
Steven D'Aprano
steve at pearwood.info
Sat Oct 2 03:13:04 CEST 2010
On Sat, 2 Oct 2010 06:31:42 am aeneas24 at priest.com wrote:
> Hi,
>
> I have created a csv file that lists how often each word in the
> Internet Movie Database occurs with different star-ratings and in
> different genres.
I would have thought that IMDB would probably have already made that
information available?
http://www.imdb.com/interfaces
> The input file looks something like this--since
> movies can have multiple genres, there are three genre rows. (This is
> fake, simplified data.)
[...]
> I can get the program to tell me how many occurrence of "the" there
> are in Thrillers (50), how many "the"'s in 1-stars (50), and how many
> 1-star drama "the"'s there are (30). But I need to be able to expand
> beyond a particular word and say "how many words total are in
> "Drama"? How many total words are in 1-star ratings? How many words
> are there in the whole corpus? On these all-word totals, I'm stumped.
The headings of your data look like this:
ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count
and you want to map words to genres. Can you tell us how big the CSV
file is? Depending on its size, you may need to use on-disk storage
(perhaps shelve, as you're already doing) but for illustration purposes
I'll assume it all fits in memory and just use regular dicts. I'm going
to create a table that stores the counts for each word versus the
genre:
Genre | the | scary | silly | exciting | ...
------------------------------------------------
Western | 934 | 3 | 5 | 256 |
Thriller | 899 | 145 | 84 | 732 |
Comedy | 523 | 1 | 672 | 47 |
...
To do this using dicts, I'm going to use a dict for genres:
genre_table = {"Western": table_of_words, ...}
and each table_of_words will look like:
{'the': 934, 'scary': 3, 'silly': 5, ...}
Let's start with a helper function and table to store the data.
# Initialise the table.
genres = {}
def add_word(genre, word, count):
genre = genre.title() # force "gEnRe" to "Genre"
word = word.lower() # force "wOrD" to "word"
count = int(count)
row = genres.get(genre, {})
n = row.get(word, 0)
row[word] = n + count
genres[genre] = row
We can simplify this code by using the confusingly named, but useful,
setdefault method of dicts:
def add_word(genre, word, count):
genre = genre.title()
word = word.lower()
count = int(count)
row = genres.setdefault(genre, {})
row[word] = row.get(word, 0) + count
Now let's process the CSV file. I'm afraid I can't remember how the CSV
module works, and I'm too lazy to look it up, so this is pseudo-code
rather than Python:
for row in csv file:
genre1 = get column Genre1
genre2 = get column Genre2
genre3 = get column Genre3
word = get column Word
count = get column Count
add_word(genre1, word, count)
add_word(genre2, word, count)
add_word(genre3, word, count)
Now we can easily query our table for useful information:
# list of unique words for the Western genre
genres["Western"].keys()
# count of unique words for the Romance genre
len(genres["Romance"])
# number of times "underdog" is used in Sports movies
genres["Sport"]["underdog"]
# total count of words for the Comedy genre
sum(genres["Comedy"].values())
Do you want to do lookups efficiently the other way as well? It's easy
to add another table:
Word | Western | Thriller | ...
------------------------------------------------
the | 934 | 899 |
scary | 3 | 145 |
...
Add a second global table:
genres = {}
words = {}
and modify the helper function:
def add_word(genre, word, count):
genre = genre.title()
word = word.lower()
count = int(count)
# Add word to the genres table.
row = genres.setdefault(genre, {})
row[word] = row.get(word, 0) + count
# And to the words table.
row = words.setdefault(word, {})
row[genre] = row.get(genre, 0) + count
--
Steven D'Aprano
More information about the Tutor
mailing list