[Tutor] Shelve & immutable objects

Steven D'Aprano steve at pearwood.info
Thu Jan 2 15:29:58 CET 2014


On Thu, Jan 02, 2014 at 04:15:06AM -0500, Keith Winston wrote:

> Separately, I'm also curious about how to process big files. For example, I
> was trying to play 100 million games of chutes & ladders, and I crashed my
> machine, I believe: the game results, including 4 ints & 2 short lists of
> ints per game, are gathered into a list, so it can become a pretty big
> list. 

How to process big files, in order of "best" to "worst" (in my opinion):

(1) Get more memory.

(2) Find a way to process them in small chunks, e.g. a line at a time, 
rather than try to hold the entire file in memory at once.

(3) Split them into small files, then process each file in turn.

(4) Pass them to external tools which are optimized for dealing with 
huge files.

(5) Find a way to process them using mmap.

(6) Write your own tool for handling huge files.


> I need to do stats and other analyses on it in the end (okay, I
> really don't NEED to play 100 million games of chutes & ladders, but as
> long as I have...): I suppose I could break it into manageable (maybe 1
> million games each), but that will make some of the stats either clunky or
> wrong (I haven't really attacked that part yet).

It shouldn't. Well, it might, but only if you're doing some pretty 
sophisticated statistical analysis. Most common statistics can be 
calculated on a single pass through the data. If you need to calculate 
(say) the mean, variance and standard deviation, it is moderately 
straight forward to calculate all three in a single pass without needing 
to have all the data in memory at once.

(Feel free to ask for more detail if you wish.)

Even statistics like median, which normally requires the entire data set 
to be in memory so it can be sorted, can be calculated with a couple of 
passes through the data. I think there is even a single pass algorithm 
for it.


> And since I'm not REALLY ready to ask this question, I'll tack it on to the
> end... I'm also beginning to think about how to speed it up: I'm imagining
> my two options are going to be to code some sections in a faster language
> (i.e. C), or maybe to introduce multi-threading since I'm working on a
> multicore machine generally (core I7), and I'm doing a lot of iterations of
> the same thing with no important order... seems like a good candidate.

A game of Chutes and Ladders (or as we call it in Australia, Snakes 
and Ladders) surely doesn't need to be optimized. It will spend most of 
its time waiting for the human user, who is probably a hundred thousand 
or million times slower than the computer.

But let's say you want to make it faster regardless. How to make it 
faster:

(1) Use a smarter algorithm or less wasteful implementation.

(2) Get more memory.

(3) Get a faster computer.

(4) I think what you are trying to do is a good candidate for PyPy, the 
optimizing Python JIT compiler. Especially if you are playing multiple 
millions of games.

(5) Profile, profile, profile, profile.

Only once you have profiled your code can you possibly hope to guess 
where the bottlenecks are. After 15 or 20 years of programming with 
Python, my guesses for the bottlenecks are still wrong more often than 
they are right.

Assuming you identify the actual bottlenecks:

(6) Try writing them in Cython, which will often give you good speedups.

(7) Or write a C extension.

Should you use threads? Probably not.

- Unless the bottleneck in your code is disk or network I/O, 
  threads will not help, they'll just make your program slower.

- Unless you use IronPython or Jython instead, in which case 
  treads *might* help. But probably not as much as you think.

- If your bottleneck is CPU processing, you can use the 
  multiprocessing module instead of threads.

- You did profile your code first didn't you?


> Now,
> I'm probably pretty far from that piece (in my learning process), but this
> is moving along pretty well so I'm open to suggestions about how to
> proceed. I've started switching up my code a fair bit to try to make it
> more OOP, though I'm still rough on that part.


Advice about optimization from some experts:

http://en.wikipedia.org/wiki/Program_optimization#Quotes

My favourite quote is:

    The First Rule of Program Optimization: 
        Don't do it. 

    The Second Rule of Program Optimization (for experts only!): 
        Don't do it yet.



-- 
Steven


More information about the Tutor mailing list