[Numpy-discussion] Loading a > GB file into array

David Cournapeau david at ar.media.kyoto-u.ac.jp
Fri Dec 21 08:45:37 EST 2007


Hans Meine wrote:
> Am Freitag, 21. Dezember 2007 13:23:49 schrieb David Cournapeau:
>   
>>> Instead of saying "memmap is ALL about disc access" I would rather
>>> like to say that "memap is all about SMART disk access" -- what I mean
>>> is that memmap should run as fast as a normal ndarray if it works on
>>> the cached part of an array.  Maybe there is a way of telling memmap
>>> when and what to cache  and when to sync that cache to the disk.
>>> In other words, memmap should perform just like a in-pysical-memory
>>> array  -- only that it once-in-a-while saves/load to/from the disk.
>>> Or is this just wishful thinking ?
>>> Is there a way of "pre loading" a given part into cache
>>> (pysical-memory) or prevent disc writes at "bad times" ?
>>> How about doing the sync from a different thread ;-)
>>>       
>> mmap is using the OS IO caches, that's kind of the point of using mmap
>> (at least in this case). Instead of doing the caching yourself, the OS
>> does it for you, and OS are supposed to be smart about this :)
>>     
>
> AFAICS this is what Sebastian wanted to say, but as the OP indicated, 
> preloading e.g. by reading the whole array once did not work for him.
> Thus, I understand Sebastian's questions as "is it possible to help the OS 
> when it is not smart enough?".  Maybe something along the lines of mlock, 
> only not quite as aggressive.
>   
I don't know exactly why it did not work, but it is not difficult to 
imagine why it could fail (when you read a 2 Gb file, it may not be 
smart on average to put the whole file in the buffer, since everything 
else is kicked out). It all depends on the situation, but there are many 
different things which can influence this behaviour: the IO scheduler, 
how smart the VM is, the FS (on linux, some FS are better than others 
for RT audio dsp, and some options are better left out), etc... On 
Linux, using the deadline IO scheduler can help, for example (that's the 
recommended scheduler for IO intensive musical applications).

But if what you want is to reliable being able to read "in real time" a 
big file which cannot fit in memory, then you need a design where 
something is doing the disk buffering as you want (again, taking the 
example I am somewhat familiar with, in audio processing, you often have 
a IO thread which does the pre-caching, and put the data into mlock'ed 
buffers to another thread, the one which is RT).

cheers,

David



More information about the NumPy-Discussion mailing list