[Tutor] Getting total counts (Steven D'Aprano)

Steven D'Aprano steve at pearwood.info
Sun Oct 3 02:19:45 CEST 2010


On Sun, 3 Oct 2010 08:29:29 am aeneas24 at priest.com wrote:
> Thanks very much for the extensive comments, Steve. I can get the
> code you wrote to work on my toy data, but my real input data is
> actually contained in 10 files that are about 1.5 GB each--when I try
> to run the code on one of those files, everything freezes.

Do you have 15 GB of memory? Actually, more, due to the overhead, the 
operating system, and so forth. If not, then I'm not surprised that 
things freeze.

> To solve this, I tried just having the data write to a different csv
> file:

A 15GB csv file is not the right approach.

> This doesn't work--I think there are problems in how the iterations
> happen. But my guess is that converting from one CSV to another isn't
> going to be as efficient as creating a shelve database.

shelve is not an industrial strength database. It's a serialised 
dictionary. The source code for shelve is ~100 lines of code, plus 
comments and docstrings. Now Python is good, but it's not *that* good 
that it can create a full-strength database in 100 lines of code.

I expect that at the very least you will need to use SQLite. With 15 GB 
of data, you may even find that SQLite isn't powerful enough and you 
may need a full blown Postgresql or MySQL database.


P.S. Please don't quote the entire digest when replying:

> -----Original Message-----
> From: tutor-request at python.org
> To: tutor at python.org
> Sent: Sat, Oct 2, 2010 1:36 am
> Subject: Tutor Digest, Vol 80, Issue 10
[snip approx 150 irrelevant lines]


-- 
Steven D'Aprano


More information about the Tutor mailing list