What strategy for random accession of records in massive FASTA file?

Sat Jan 15 19:22:50 EST 2005

On Sat, 15 Jan 2005 15:24:56 -0500, Steve Holden <steve at holdenweb.com> wrote:

>Bulba! wrote:
>
>> On 14 Jan 2005 12:30:57 -0800, Paul Rubin
>> <http://phr.cx@NOSPAM.invalid> wrote:
>> 
>> 
>>>Mmap lets you treat a disk file as an array, so you can randomly
>>>access the bytes in the file without having to do seek operations
>> 
>> 
>> Cool!
>> 
>> 
>>>Just say a[234]='x' and you've changed byte 234 of the file to the
>>>letter x.  
>> 
>> 
>> However.. however.. suppose this element located more or less
>> in the middle of an array occupies more space after changing it, 
>> say 2 bytes instead of 1. Will flush() need to rewrite the half of
>> mmaped file just to add that one byte? 
>>
I would wonder what mm.find('pattern') in the middle of a huge file
would do to the working set vs sequential reads as in my little toy
(which BTW is also happy to expand or contract old vs new replacement string
as it streams buffers file to file).

>Nope. If you try a[234] = 'banana' you'll get an error message. The mmap 
>protocol doesn't support insertion and deletion, only overwriting.
>
>Of course, it's far too complicated to actually *try* this stuff before 
>pontificating  [not]:
>
>  >>> import mmap
>  >>> f = file("/tmp/Xout.txt", "r+")
>  >>> mm = mmap.mmap(f.fileno(), 200)
>  >>> mm[1:10]
>'elcome to'
>  >>> mm[1] = "banana"
>Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>IndexError: mmap assignment must be single-character string
>  >>> mm[1:10] = 'ishing ::'
>  >>> mm[1:10]
>'ishing ::'
>  >>> mm[1:10] = 'a'
>Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>IndexError: mmap slice assignment is wrong size
>  >>>
>
>> flush() definitely makes updating less of an issue,  I'm just 
>> curious about the cost of writing small changes scattered all 
>> over the place back to the large file.
>> 
>Some of this depends on whether the mmap is shared or private, of 
>course, but generally speaking you can ignore the overhead, and the 
>flush() calls will be automatic as long as you don't mix file and string 
>operations. The programming convenience is amazing.
That part does look good, but will scanning a large file with find
cause massive swapouts, or is there some smart prioritization or
hidden sequential windowing that limits mmap's impact?
>
>> --
>> I have come to kick ass, chew bubble gum and do the following:
>> 
>> from __future__ import py3k
>> 
>> And it doesn't work.
>
>So make it work :-)
>

Regards,
Bengt Richter