[Numpy-discussion] Multi thread loading data

Sebastian Haase seb.haase at gmail.com
Thu Jul 2 12:04:41 EDT 2009


On Thu, Jul 2, 2009 at 5:38 PM, Chris Colbert<sccolbert at gmail.com> wrote:
> Who are quoting Sebastian?
>
> Multiprocessing is a python package that spawns multiple python
> processes, effectively side-stepping the GIL, and provides easy
> mechanisms for IPC. Hence the need for serialization....
>
I was replying to the OP's email

Regarding your comment: can separate processes not access the same
memory space !? via shared memory ...
I think there was a discussion about this not to long ago on this list.

-S.



> On Thu, Jul 2, 2009 at 11:30 AM, Sebastian Haase<seb.haase at gmail.com> wrote:
>> On Thu, Jul 2, 2009 at 5:14 PM, Chris Colbert<sccolbert at gmail.com> wrote:
>>> can you hold the entire file in memory as single array with room to spare?
>>> If so, you could use multiprocessing and load a bunch of smaller
>>> arrays, then join them all together.
>>>
>>> It wont be super fast, because serializing a numpy array is somewhat
>>> slow when using multiprocessing. That said, its still faster than disk
>>> transfers.
>>>
>>> I'm  sure some numpy expert will come on here though and give you a
>>> much better idea.
>>>
>>>
>>>
>>> On Wed, Jul 1, 2009 at 7:57 AM, Mag Gam<magawake at gmail.com> wrote:
>>>> Is it possible to use loadtxt in a mult thread way? Basically, I want
>>>> to process a very large CSV file (100+ million records) and instead of
>>>> loading thousand elements into a buffer process and then load another
>>>> 1 thousand elements and process and so on...
>>>>
>>>> I was wondering if there is a technique where I can use multiple
>>>> processors to do this faster.
>>>>
>>>> TIA
>>
>> Do you know about the GIL (global interpreter lock) in Python ?
>> It  means that Python isn't doing "real" multithreading...
>> Only if one thread is e.g. doing some slow or blocking io stuff, the
>> other thread could keep work, e.g. doing CPU-heavy numpy stuff.
>> But you would get 2-CPU numpy code - except for some C-implemented
>> "long running" operations -- these should be programmed in a way that
>> releases the GIL so that the other CPU could go on doing it's Python
>> code.
>>
>> HTH,
>> Sebastian Haase
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>



More information about the NumPy-Discussion mailing list