Processing large CSV files - how to maximise throughput?

Fri Oct 25 07:24:45 EDT 2013

On 25/10/2013 02:13, Chris Angelico wrote:

> On Fri, Oct 25, 2013 at 2:57 PM, Dave Angel <davea at davea.name> wrote:
>> But I would concur -- probably they'll both give about the same speedup.
>> I just detest the pain that multithreading can bring, and tend to avoid
>> it if at all possible.
>
> I don't have a history of major pain from threading. Is this a Python
> thing,

No, all my pain has been in C++.  But nearly all my Python code has
been written solo, so I can adopt strict rules.

The problem comes that with many people's sticky fingers in the code
pie, things that seem innocuous are broken once they can happen from
multiple threads.  And C++ does not give you any way to tell at a glance
where you're asking for trouble.

I've also been involved in projects that existed and were broken long
before I came on the scene.  Finding that something is broken when it
wasn't caused by any recent change is painful.  And in large, old C++
projects, memory management is convoluted.  Even when not broken,
sometimes performance takes big hits because multiple threads are
banging at the same cache line. (using thread affinity to make sure
each thread uses a consistent core)  So you have a dozen memory
allocation strategies, each with different rules and restrictions.

> or have I just been really really fortunate (growing up on OS/2
> rather than Windows has definitely been, for me, a major boon)?
> Generally, I find threads to be convenient, though of course not
> always useful (especially in serialized languages).
>

All my heavy duty multi-task/multi-thread stuff has been on Linux.

And Python floats above most of the issues I'm trying to describe.
E.g. you don't have to tune the memory management because it's out of
your control. Still my bias persists.

-- 
DaveA