[Tutor] Getting total counts (Steven D'Aprano)

aeneas24 at priest.com aeneas24 at priest.com
Sat Oct 2 23:29:29 CEST 2010



Thanks very much for the extensive comments, Steve. I can get the code you wrote to work on my toy data, but my real input data is actually contained in 10 files that are about 1.5 GB each--when I try to run the code on one of those files, everything freezes. 

To solve this, I tried just having the data write to a different csv file:

lines = csv.reader(file(src_filename))
csv_writer = csv.writer(file(output_filename, 'w'))
for line in lines:
    doc, g1, g2, g3, rating, ratingmax, reviewer, helpful, h_total, word, count = line
    row = [add_word(g1, word, count), add_word(g2, word, count), add_word(g3, word, count)]
    csv_writer.writerow(row) 


This doesn't work--I think there are problems in how the iterations happen. But my guess is that converting from one CSV to another isn't going to be as efficient as creating a shelve database. I have some code that works to create a db when I release it on a small subset of my data, but when I try to turn one of the 1.5 GB files into a db, it can't do it. I don't understand why it works for small data and not big (it makes sense to me that your table approach might choke on big amounts of data--but why the shelve() code below?)

I think these are the big things I'm trying to get the code to do:
- Get my giant CSV files into a useful format, probably a db (can do for small amounts of data, but not large)
- Extract genre and star-rating information about particular words from the db (I seem to be able to do this)
- Get total counts for all words in each genre, and for all words in each star-rating category (your table approach works on small data, but I can't get it to scale)

def csv2shelve(src_filename, shelve_filename):
    # I open the shelve file for writing.
    if os.path.exists(shelve_filename):
        os.remove(shelve_filename)
    # I create the shelve db.
    db = shelve.open(shelve_filename, writeback=True) # The writeback stuff is a little confusing in the help pages, maybe this is a problem?
    # I open the src file.
    lines = csv.reader(file(src_filename))
    for line in lines:
        doc, g1, g2, g3, rating, word, count = line
        if word not in db:
            db[word] = []
            try:
                rating = int(rating)
            except:
                pass
         db[word].append({
            "genres":{g1:True, g2:True, g3:True},
            "rating":rating,
            "count":int(count)
            })
    
    db.close()

Thanks again, Steve. (And everyone/anyone else.)

Tyler



-----Original Message-----
From: tutor-request at python.org
To: tutor at python.org
Sent: Sat, Oct 2, 2010 1:36 am
Subject: Tutor Digest, Vol 80, Issue 10


Send Tutor mailing list submissions to
   tutor at python.org
To subscribe or unsubscribe via the World Wide Web, visit
   http://mail.python.org/mailman/listinfo/tutor
r, via email, send a message with subject or body 'help' to
   tutor-request at python.org
You can reach the person managing the list at
   tutor-owner at python.org
When replying, please edit your Subject line so it is more specific
han "Re: Contents of Tutor digest..."

oday's Topics:
   1. Re: (de)serialization questions (Lee Harr)
  2. Re: regexp: a bit lost (Steven D'Aprano)
  3. Re: regexp: a bit lost (Alex Hall)
  4. Re: (de)serialization questions (Alan Gauld)
  5. Re: Getting total counts (Steven D'Aprano)
  6. data question (Roelof Wobben)

---------------------------------------------------------------------
Message: 1
ate: Sat, 2 Oct 2010 03:26:21 +0430
rom: Lee Harr <missive at hotmail.com>
o: <tutor at python.org>
ubject: Re: [Tutor] (de)serialization questions
essage-ID: <SNT106-W199ADD4FDC9DA1C977F89CB1690 at phx.gbl>
ontent-Type: text/plain; charset="windows-1256"

>> I have data about zip codes, street and city names (and perhaps later also 
f
>> street numbers). I made a dictionary of the form {zipcode: (street, city)}
>
> One dictionary with all of the data?
>
> That does not seem like it will work. What happens when
> 2 addresses have the same zip code?
You did not answer this question.
Did you think about it?

 Maybe my main question is as follows: what permanent object is most suitable 
o
 store a large amount of entries (maybe too many to fit into the computer's
 memory), which can be looked up very fast.
One thing about Python is that you don't normally need to
hink about how your objects are stored (memory management).
It's an advantage in the normal case -- you just use the most
onvenient object, and if it's fast enough and small enough
ou're good to go.
Of course, that means that if it is not fast enough, or not
mall enough, then you've got to do a bit more work to do.

 Eventually, I want to create two objects:
 1-one to look up street name and city using zip code
So... you want to have a function like:
def addresses_by_zip(zipcode):
?? '''returns list of all addresses in the given zipcode'''
?? ....

 2-one to look up zip code using street name, apartment number and city
and another one like:
def zip_by_address(street_name, apt, city):
?? '''returns the zipcode for the given street name, apartment, and city'''
?? ....

o me, it sounds like a job for a database (at least behind the scenes),
ut you could try just creating a custom Python object that holds
hese things:
class Address(object):
?? street_number = '345'
?? street_name = 'Main St'
?? apt = 'B'
?? city = 'Springfield'
?? zipcode = '99999'
Then create another object that holds a collection of these addresses
nd has methods addresses_by_zip(self, zipcode) and
ip_by_address(self, street_number, street_name, apt, city)

 I stored object1 in a marshalled dictionary. Its length is about 450.000 (I 
ive
 in Holland, not THAT many streets). Look-ups are incredibly fast (it has to,
 because it's part of an autocompletion feature of a data entry program). I
 haven't got the street number data needed for object2 yet, but it's going to 
e
 much larger. Many streets have different zip codes for odd or even numbers, or
 the zip codes are divided into street number ranges (for long streets).
Remember that you don't want to try to optimize too soon.
Build a simple working system and see what happens. If it
s too slow or takes up too much memory, fix it.

 You suggest to simply use a file. I like simple solutions, but doesn't that, 
y
 definition, require a slow, linear search?
You could create an index, but then any database will already have
n indexing function built in.
I'm not saying that rolling your own custom database is a bad idea,
ut if you are trying to get some work done (and not just playing around
nd learning Python) then it's probably better to use something that is
lready proven to work.

f you have some code you are trying out, but are not sure you
re going the right way, post it and let people take a look at it.
                      

-----------------------------
Message: 2
ate: Sat, 2 Oct 2010 10:19:21 +1000
rom: Steven D'Aprano <steve at pearwood.info>
o: Python Tutor <Tutor at python.org>
ubject: Re: [Tutor] regexp: a bit lost
essage-ID: <201010021019.21909.steve at pearwood.info>
ontent-Type: text/plain;  charset="iso-8859-1"
On Sat, 2 Oct 2010 01:14:27 am Alex Hall wrote:
 >> Here is my test:
 >> s=re.search(r"[\d+\s+\d+\s+\d]", l)
 >
 > Try this instead:
 >
 > re.search(r'\d+\s+\D*\d+\s+\d', l)
...]
 Understood. My intent was to ask why my regexp would match anything
 at all.
Square brackets create a character set, so your regex tests for a string 
hat contains a single character matching a digit (\d), a plus sign (+) 
r a whitespace character (\s). The additional \d + \s in the square 
rackets are redundant and don't add anything.
-- 
teven D'Aprano

-----------------------------
Message: 3
ate: Fri, 1 Oct 2010 20:47:29 -0400
rom: Alex Hall <mehgcap at gmail.com>
o: "Steven D'Aprano" <steve at pearwood.info>
c: Python Tutor <Tutor at python.org>
ubject: Re: [Tutor] regexp: a bit lost
essage-ID:
   <AANLkTin=baJcU0E8py46gjZKmur8MTSEMBZC6=M8sPuR at mail.gmail.com>
ontent-Type: text/plain; charset=ISO-8859-1
On 10/1/10, Steven D'Aprano <steve at pearwood.info> wrote:
 On Sat, 2 Oct 2010 01:14:27 am Alex Hall wrote:
> >> Here is my test:
> >> s=re.search(r"[\d+\s+\d+\s+\d]", l)
> >
> > Try this instead:
> >
> > re.search(r'\d+\s+\D*\d+\s+\d', l)
 [...]
> Understood. My intent was to ask why my regexp would match anything
> at all.

 Square brackets create a character set, so your regex tests for a string
 that contains a single character matching a digit (\d), a plus sign (+)
 or a whitespace character (\s). The additional \d + \s in the square
 brackets are redundant and don't add anything.
h, that explains it then. :) Thanks.

 --
 Steven D'Aprano
 _______________________________________________
 Tutor maillist  -  Tutor at python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor


- 
ave a great day,
lex (msg sent from GMail website)
ehgcap at gmail.com; http://www.facebook.com/mehgcap

-----------------------------
Message: 4
ate: Sat, 2 Oct 2010 02:01:40 +0100
rom: "Alan Gauld" <alan.gauld at btinternet.com>
o: tutor at python.org
ubject: Re: [Tutor] (de)serialization questions
essage-ID: <i8609s$l4a$1 at dough.gmane.org>
ontent-Type: text/plain; format=flowed; charset="iso-8859-1";
   reply-type=original

Albert-Jan Roskam" <fomcl at yahoo.com> wrote
> Maybe my main question is as follows: what permanent object is most 
 suitable to
 store a large amount of entries (maybe too many to fit into the 
 computer's
 memory), which can be looked up very fast.
It depends on the nature of the object and the lookup but in general
 database would be the best solution. For special (heirarchical)
ata an LDAP directory may be more appropriate.
Otherwise you are looking at a custom designed file structure.
> Eventually, I want to create two objects:
 1-one to look up street name and city using zip code
 2-one to look up zip code using street name, apartment number and 
 city
For this a simple relational database wouldbe best.
QLlite should do and is part of the standard library.
t can also be used in memory for faster speed with smaller data sets.
> You suggest to simply use a file. I like simple solutions, but 
 doesn't that, by
 definition, require a slow, linear search?
No, you can use random access provided yopu can relate the key to the
ocation - thats what databases do for you under the covers.
> Funny you should mention sqlite: I was just considering it 
 yesterday. Gosh,
 Python has so much interesting stuff to offer!
Sqlite operating in-memory would be a good solution for you I think.
You can get a basic tutorial on Sqllite and python in the databases 
opic
f my tutorial...
HTH,

- 
lan Gauld
uthor of the Learn to Program web site
ttp://www.alan-g.me.uk/


-----------------------------
Message: 5
ate: Sat, 2 Oct 2010 11:13:04 +1000
rom: Steven D'Aprano <steve at pearwood.info>
o: tutor at python.org
ubject: Re: [Tutor] Getting total counts
essage-ID: <201010021113.04679.steve at pearwood.info>
ontent-Type: text/plain;  charset="utf-8"
On Sat, 2 Oct 2010 06:31:42 am aeneas24 at priest.com wrote:
 Hi,

 I have created a csv file that lists how often each word in the
 Internet Movie Database occurs with different star-ratings and in
 different genres.
I would have thought that IMDB would probably have already made that 
nformation available?
http://www.imdb.com/interfaces

 The input file looks something like this--since 
 movies can have multiple genres, there are three genre rows. (This is
 fake, simplified data.)
...]
 I can get the program to tell me how many occurrence of "the" there
 are in Thrillers (50), how many "the"'s in 1-stars (50), and how many
 1-star drama "the"'s there are (30). But I need to be able to expand
 beyond a particular word and say "how many words total are in
 "Drama"? How many total words are in 1-star ratings? How many words
 are there in the whole corpus? On these all-word totals, I'm stumped.
The headings of your data look like this:
ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count
and you want to map words to genres. Can you tell us how big the CSV 
ile is? Depending on its size, you may need to use on-disk storage 
perhaps shelve, as you're already doing) but for illustration purposes 
'll assume it all fits in memory and just use regular dicts. I'm going 
o create a table that stores the counts for each word versus the 
enre:

enre    | the | scary | silly | exciting | ... 
-----------------------------------------------
estern  | 934 |   3   |   5   |    256   |
hriller | 899 |  145  |   84  |    732   |
omedy   | 523 |   1   |  672  |     47   |
..
To do this using dicts, I'm going to use a dict for genres:
genre_table = {"Western": table_of_words, ...}
and each table_of_words will look like:
{'the': 934, 'scary': 3, 'silly': 5, ...}

et's start with a helper function and table to store the data.
# Initialise the table.
enres = {}
def add_word(genre, word, count):
   genre = genre.title()  # force "gEnRe" to "Genre"
   word = word.lower()  # force "wOrD" to "word"
   count = int(count)
   row = genres.get(genre, {})
   n = row.get(word, 0)
   row[word] = n + count
   genres[genre] = row

e can simplify this code by using the confusingly named, but useful, 
etdefault method of dicts:
def add_word(genre, word, count):
   genre = genre.title()
   word = word.lower()
   count = int(count)
   row = genres.setdefault(genre, {})
   row[word] = row.get(word, 0) + count

Now let's process the CSV file. I'm afraid I can't remember how the CSV 
odule works, and I'm too lazy to look it up, so this is pseudo-code 
ather than Python:
for row in csv file:
   genre1 = get column Genre1
   genre2 = get column Genre2
   genre3 = get column Genre3
   word = get column Word
   count = get column Count
   add_word(genre1, word, count)
   add_word(genre2, word, count)
   add_word(genre3, word, count)

ow we can easily query our table for useful information:
# list of unique words for the Western genre
enres["Western"].keys()  
 count of unique words for the Romance genre
en(genres["Romance"])  
 number of times "underdog" is used in Sports movies
enres["Sport"]["underdog"]
 total count of words for the Comedy genre
um(genres["Comedy"].values())

Do you want to do lookups efficiently the other way as well? It's easy 
o add another table:
Word  | Western | Thriller | ... 
-----------------------------------------------
he   |   934   |   899    |
cary |    3    |   145    |
..

dd a second global table:
genres = {}
ords = {}

nd modify the helper function:
def add_word(genre, word, count):
   genre = genre.title()
   word = word.lower()
   count = int(count)
   # Add word to the genres table.
   row = genres.setdefault(genre, {})
   row[word] = row.get(word, 0) + count
   # And to the words table.
   row = words.setdefault(word, {})
   row[genre] = row.get(genre, 0) + count


- 
teven D'Aprano

-----------------------------
Message: 6
ate: Sat, 2 Oct 2010 08:35:13 +0000
rom: Roelof Wobben <rwobben at hotmail.com>
o: <tutor at python.org>
ubject: [Tutor] data question
essage-ID: <SNT118-W643D8156677EB89D46D414AE6A0 at phx.gbl>
ontent-Type: text/plain; charset="iso-8859-1"

Hello, 

s a test I would write a programm where a user can input game-data like 
ome-team, away-team, home-score, away-score) and makes a ranking of it. And I'm 
ot looking of a OOP solution because im not comfertle with OOP.

ow my question is :

n which datatype can I put this data in.

 thought myself of a dictonary of tuples.

egards,

oelof
                     
------------------------------
_______________________________________________
utor maillist  -  Tutor at python.org
ttp://mail.python.org/mailman/listinfo/tutor

nd of Tutor Digest, Vol 80, Issue 10
************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101002/19a44eda/attachment-0001.html>


More information about the Tutor mailing list