Processing large CSV files - how to maximise throughput?

Fri Oct 25 07:42:00 EDT 2013

On Fri, Oct 25, 2013 at 10:24 PM, Dave Angel <davea at davea.name> wrote:
> On 25/10/2013 02:13, Chris Angelico wrote:
>
>> On Fri, Oct 25, 2013 at 2:57 PM, Dave Angel <davea at davea.name> wrote:
>>> But I would concur -- probably they'll both give about the same speedup.
>>> I just detest the pain that multithreading can bring, and tend to avoid
>>> it if at all possible.
>>
>> I don't have a history of major pain from threading. Is this a Python
>> thing,
>
> No, all my pain has been in C++.  But nearly all my Python code has
> been written solo, so I can adopt strict rules.
>
> The problem comes that with many people's sticky fingers in the code
> pie, things that seem innocuous are broken once they can happen from
> multiple threads.  And C++ does not give you any way to tell at a glance
> where you're asking for trouble.

Yeah, that is a big issue. And even the simple rule I mentioned
earlier (keep everything on the stack) isn't sufficient when you use
library functions that aren't thread-safe... even slabs of the
standard library aren't, and I'm not just talking about obvious ones
like strtok().

> I've also been involved in projects that existed and were broken long
> before I came on the scene.  Finding that something is broken when it
> wasn't caused by any recent change is painful.  And in large, old C++
> projects, memory management is convoluted.  Even when not broken,
> sometimes performance takes big hits because multiple threads are
> banging at the same cache line. (using thread affinity to make sure
> each thread uses a consistent core)  So you have a dozen memory
> allocation strategies, each with different rules and restrictions.

Absolutely. Something might have just-happened to work, even for
years. I came across something that would almost be brown-paper-bag
quality, in my own production code at work, today; the TCP socket
protocol between two processes wasn't properly looking for the newline
that marks the end of the message, and would have bugged out badly if
ever two messages had been combined into one socket-read call. With
everything working perfectly, like on a low-latency LAN where all our
testing happens, it was all safe, but if anything lagged out, it would
have meant messages got lost. Yet it's been working for years without
a noticed glitch.

>> or have I just been really really fortunate (growing up on OS/2
>> rather than Windows has definitely been, for me, a major boon)?
>> Generally, I find threads to be convenient, though of course not
>> always useful (especially in serialized languages).
>
> All my heavy duty multi-task/multi-thread stuff has been on Linux.

OS/2 is very Unix-like in many areas, but it doesn't have a convenient
fork(), so processes are rather more fiddly to spawn than threads are.
Actually, sliding from OS/2 to Linux has largely led to me sliding
from threads to processes - forking a process under Linux is as easy
as spinning off a thread under OS/2, and as easy to pass state to
(though (obviously) harder to pass state back from).

> And Python floats above most of the issues I'm trying to describe.
> E.g. you don't have to tune the memory management because it's out of
> your control. Still my bias persists.

Hrmmmmm.... yes, I can imagine running into difficulties with a
non-thread-safe malloc/free implementation... I think that would rate
about a 4 on the XKCD 883 scale.

ChrisA