Parallelising code

Mon Sep 15 13:07:41 EDT 2008

> Essentially, I have a number of CSV files, let's say 100, each
> containing about 8000 data points. For each point, I need to look up
> some other data structures (generated in advance) and append the point
> to a relevant list. I wondered whether I could get each core to handle
> a few files each. I have a few questions:
>
> - Am I actually going to get any speed up from parallelism, or is it
> likely that most of my processing time is spent reading files? I guess
> I can profile for this?

That probably depends on both how much data is involved in a "data
point" (ie, is it just one value, or are you parsing several fields
from the CSV per record), and how much processing each point involves.
 Profiling should enlighten you, yes.  You may also have issues with
I/O contention if you have lots of threads trying to read from disk at
once, although I'm not sure how much of an impact that will have.

>
> - Is list.append() thread safe? (not sure if this is the right term)
> what I mean is, can two separate processors file a point in the same
> list at the same time without anything horrible happening? Do I need
> to do anything special (mutex or whatever) to make this happen, or
> will it happen automatically?

http://mail.python.org/pipermail/python-list/2007-March/430017.html

http://effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm