Parallelising code

Tue Sep 16 05:16:12 EDT 2008

On 15 Sep, 18:46, "psaff... at googlemail.com" <psaff... at googlemail.com>
wrote:
> I have some file processing code that has to deal with quite a lot of
> data. I have a quad core machine, so I wondered whether I could take
> advantage of some parallelism.

Take a look at this page for some solutions:

http://wiki.python.org/moin/ParallelProcessing

In addition, Jython and IronPython provide the ability to use threads
more effectively.

> Essentially, I have a number of CSV files, let's say 100, each
> containing about 8000 data points. For each point, I need to look up
> some other data structures (generated in advance) and append the point
> to a relevant list. I wondered whether I could get each core to handle
> a few files each. I have a few questions:
>
> - Am I actually going to get any speed up from parallelism, or is it
> likely that most of my processing time is spent reading files? I guess
> I can profile for this?

There are a few things to consider, and it is useful to see where most
of the time is being spent. One interesting exercise called "Wide
Finder 2", run by Tim Bray (see [1] for more details), investigated
the benefits of log file processing using many concurrent processes,
but it was often argued that the greatest speed-up over a naive serial
implementation could be achieved by optimising the input and output
and by choosing the right parsing strategy.

Paul

[1] http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2