Processing large CSV files - how to maximise throughput?

Chris Angelico rosuav at gmail.com
Fri Oct 25 03:26:34 EDT 2013


On Fri, Oct 25, 2013 at 5:39 PM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Basically, with multiple processes, you start with independent systems and
> add connections specifically where needed, whereas with threads, you start
> with completely shared state and then prune away interdependencies and
> concurrency until it seems to work safely. That approach makes it
> essentially impossible to prove that threading is safe in a given setup,
> except for the really trivial cases.

Not strictly true. With multiple threads, you start with completely
shared global state and completely independent local state (in
assembly language, shared data segment and separate stack). If you
treat your globals as either read-only or carefully controlled, then
it makes little difference whether you're forking processes or
spinning off threads, except that with threads you don't need special
data structures (IPC-based ones) for the global state.

For me, threading largely grew out of the same sorts of concerns as
recursion - as long as all your internal state is in locals, nothing
can hurt you. Of course, it's still far easier to shoot yourself in
the foot with threads than with processes, but for the tasks I've used
them for, I've never found footholes; that may, however, be inherent
to the simplicity of the two main jobs I used threads for: socket
handling (where nearly everything's I/O bound) and worker threads spun
off to let the GUI remain responsive (posting a message back to the
main thread when there's a result).

ChrisA



More information about the Python-list mailing list