[Numpy-discussion] Loading a > GB file into array

Fri Dec 21 07:23:49 EST 2007

Sebastian Haase wrote:
> On Dec 21, 2007 12:11 AM, Martin Spacek <numpy at mspacek.mm.st> wrote:
>   
>>>> By the way, I installed 64-bit linux (ubuntu 7.10) on the same machine,
>>>> and now numpy.memmap works like a charm. Slicing around a 15 GB file is fun!
>>>>
>>>>         
>>> Thanks for the feedback !
>>> Did you get the kind of speed you need and/or the speed you were hoping for ?
>>>       
>> Nope. Like I wrote earlier, it seems there isn't time for disk access in
>> my main loop, which is what memmap is all about. I resolved this by
>> loading the whole file into memory as a python list of 2D arrays,
>> instead of one huge contiguous 3D array. That got me an extra 100 to 200
>> MB of physical memory to work with (about 1.4GB out of 2GB total) on
>> win32, which is all I needed.
>>
>>     
>
> Instead of saying "memmap is ALL about disc access" I would rather
> like to say that "memap is all about SMART disk access" -- what I mean
> is that memmap should run as fast as a normal ndarray if it works on
> the cached part of an array.  Maybe there is a way of telling memmap
> when and what to cache  and when to sync that cache to the disk.
> In other words, memmap should perform just like a in-pysical-memory
> array  -- only that it once-in-a-while saves/load to/from the disk.
> Or is this just wishful thinking ?
> Is there a way of "pre loading" a given part into cache
> (pysical-memory) or prevent disc writes at "bad times" ?
> How about doing the sync from a different thread ;-)
>   
mmap is using the OS IO caches, that's kind of the point of using mmap 
(at least in this case). Instead of doing the caching yourself, the OS 
does it for you, and OS are supposed to be smart about this :)

cheers,

David