[Numpy-discussion] Multi thread loading data

Sebastian Haase seb.haase at gmail.com
Thu Jul 2 11:30:03 EDT 2009


On Thu, Jul 2, 2009 at 5:14 PM, Chris Colbert<sccolbert at gmail.com> wrote:
> can you hold the entire file in memory as single array with room to spare?
> If so, you could use multiprocessing and load a bunch of smaller
> arrays, then join them all together.
>
> It wont be super fast, because serializing a numpy array is somewhat
> slow when using multiprocessing. That said, its still faster than disk
> transfers.
>
> I'm  sure some numpy expert will come on here though and give you a
> much better idea.
>
>
>
> On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam<magawake at gmail.com> wrote:
>> Is it possible to use loadtxt in a mult thread way? Basically, I want
>> to process a very large CSV file (100+ million records) and instead of
>> loading thousand elements into a buffer process and then load another
>> 1 thousand elements and process and so on...
>>
>> I was wondering if there is a technique where I can use multiple
>> processors to do this faster.
>>
>> TIA

Do you know about the GIL (global interpreter lock) in Python ?
It  means that Python isn't doing "real" multithreading...
Only if one thread is e.g. doing some slow or blocking io stuff, the
other thread could keep work, e.g. doing CPU-heavy numpy stuff.
But you would get 2-CPU numpy code - except for some C-implemented
"long running" operations -- these should be programmed in a way that
releases the GIL so that the other CPU could go on doing it's Python
code.

HTH,
Sebastian Haase



More information about the NumPy-Discussion mailing list