file ops on the fly ?

Mon Jul 22 18:12:08 EDT 2002

On Sat, 20 Jul 2002 08:56:49 GMT, Alex Martelli <aleax at aleax.it> wrote:
[...]
>
>Some INCREDIBLY clever buffering algorithm might perhaps be
>able to DOUBLE the speed of your desired in-place-editing
>approach depending on your patterns of deletions and
>insertions, by using (not wasting:-) memory most cleverly.
>So you'd only slow down by say 500 times insted of by 1000.
>Do you think it would be worth the huge coding effort?
>
If Alex Martelli's first reaction is that the buffering would
have to be "INCREDIBLY" clever, that's saying something ;-)

But the job is essentially building a diff file (or whatever
memory representation of same is convenient) incrementally and
presenting an updated view of a file window on demand, I would
think. There is no way manually entered changes wouldn't fit
in memory, assuming you represent inserted files (if that's
required) by reference and substring parameters instead of literally.

If you think of the current state of the file as a chunk list,
with chunks either being strings or file slices, it should be easy
to support read, and the same operating that indentifies the chunk
list slice spanning a read window would re required when you want
to do a write. The write would just modify or delete the first chunk
and delete all intermediate chunks up to the last, modifying the
last chunk to represent a remaining fragment as necessary

If you made a class to do this, you could have it do a sequential
merge/update at the end of a session. The question would be what
a convenient API for the editor would be besides read/write/seek
on a virtual file object (e.g., you might need something for lazy
file insertions, depending on the editor's requirements, and/or want
to see a line sequence window rather than a byte stream window, and
whether you want special undo support in the class, etc.).

Updating in place, even sequentially at the end, poses a risk though,
so I'd rather have the disk space to merge to a new temp file and
rename old->old.bak, rename temp->old, and maybe delete old.bak at the end.
If that requires a new disk, it's definitely time to do as you say:
>
>Buy a new disk.  They're cheap, VERY cheap.
>
That's true, but if the OP has an older system, it may be a pain
to get past 8GB, or even lower values. Of course that's a good
excuse for a whole new system (though reading EULAs is emotionally
challenging 8^/ ).

>
>Or, change your data's structure.  A flat text file does
>not really make much sense when dealing with gigabytes,
>particularly for data that needs to be updated in-place
>(conceptually) such as yours.  *INVEST* (not a waste!-)
>some disk space into structuring your data in a way that
>is a closer match for your application's needs...!

Maybe a simple RDB of lines with suitable indexing?

[ a weekend later ]

I decided to try something along the lines of the above.

Actually, I made a second pass, based on a mutable string
represented as chunks above (strings or file slice specs),
and supporting __getitem__,__setitem__,__delitem__ with
slices, and hacked read/write/seek using those. Here's
an example, peeking at the internal chunk list as we go:

 >>> from VMString import VMString as VMS
 >>> vs = VMS(path='test.txt')
You can also instantiate other ways, see futher down.

Here we induce a call to str(vs):
 >>> print '-------\n%s-------' % vs
 -------
 Digits @[42:52], then alphabet @[54:80]:
 0123456789
 abcdefghijklmnopqrstuvwxyz 

 Last line, after a blank one.
 -------

Here we do the same for slices, which evaluate to new VMString instances
 >>> print '%s %s' % (vs[42:52],vs[54:80])
 0123456789 abcdefghijklmnopqrstuvwxyz

There is still only a single chunk in vs, representing the file.
 >>> vs.chunks
 [<test.txt[0:115]>]

Now we'll modify it and have a look:
 >>> vs[59:59]='XXX'
 >>> vs.chunks
 [<test.txt[0:59]>, 'XXX', <test.txt[59:115]>]

Notice the old chunk was split in two at the destination slice
Now we'll go past the first X and replace 10 characters
 >>> vs[60:70]='YYY'
 >>> vs.chunks
 [<test.txt[0:59]>, 'X', 'YYY', <test.txt[67:115]>]

Notice that bit into the last file slice
Now we'll replace 5 character starting one before the X,
which should kill the YYY:
 >>> vs[58:63]='ZZZ'
 >>> vs.chunks
 [<test.txt[0:58]>, 'ZZZ', <test.txt[67:115]>]

QED. Printing the whole again:
 >>> print '-------\n%s-------' % vs
 -------
 Digits @[42:52], then alphabet @[54:80]:
 0123456789
 abcdZZZnopqrstuvwxyz

 Last line, after a blank one.
 -------

Now I happen to have a file called test2.txt with just "<Test2.txt>" in it (no \n).
We'll try to pqr with that, after the ZZZ:

 >>> str(vs[63:66])
 'pqr'
Yup, that's where pqr is
Here goes:
 >>> vs[63:66] = VMS(path='test2.txt')
 >>> vs.chunks
 [<test.txt[0:58]>, 'ZZZ', <test.txt[67:69]>, <test2.txt[0:11]>, <test.txt[72:115]>]

Note what happened to the chunks (now two different files involved).
Actually, such small strings should just be converted to actual strings, with some
threshold that leaves only really large chunks as file slice specs.
The result of the above is:

 >>> print '-------\n%s-------' % vs
 -------
 Digits @[42:52], then alphabet @[54:80]:
 0123456789
 abcdZZZno<Test2.txt>stuvwxyz

 Last line, after a blank one.
 -------

Oh, there is a file interface. We know ZZZ starts at 58, so let's
try 10 characters from what ought to be the beginning of the line at 54:
 >>> vs.seek(54)
 >>> vs.read(10)
 'abcdZZZno<'

Seems to have worked. Now we'll skip 6 characters (which should be 'Test2.')
and write a new extension 'foo'
 >>> vs.seek(6,1)
 >>> vs.write('foo')

Check result
 >>> print '-------\n%s-------' % vs
 -------
 Digits @[42:52], then alphabet @[54:80]:
 0123456789
 abcdZZZno<Test2.foo>stuvwxyz

 Last line, after a blank one.
 -------

Notice what happened to the chunks. We split the text2 slice by putting 'foo' inside it.
 >>> vs.chunks
 [<test.txt[0:58]>, 'ZZZ', <test.txt[67:69]>, <test2.txt[0:7]>, 'foo', <test2.txt[10:11]>,
 <test.txt[72:115]>]

Oop, I almost forgot about del. Let's kill everything from ZZZ up to one.
Veriy our guess that 58+len('ZZZno') is the place to start, and knowing file ends with \r\n:
 >>> str(vs[63:-6])
 '<Test2.foo>stuvwxyz\r\n\r\nLast line, after a blank '

Looks ok, now delete that stretch and view result:
 >>> del vs[63:-6]
 >>> print '-------\n%s-------' % vs
 -------
 Digits @[42:52], then alphabet @[54:80]:
 0123456789
 abcdZZZnoone.
 -------

And view chunks
 >>> vs.chunks
 [<test.txt[0:58]>, 'ZZZ', <test.txt[67:69]>, <test.txt[109:115]>]

Look at their string representations:
 >>> map(str,vs.chunks)
 ['Digits @[42:52], then alphabet @[54:80]:\r\n0123456789\r\nabcd', 'ZZZ', 'no', 'one.\r\n']

We'll, this is a kind of functional proof of concept. It's not very efficient right now ;-)
But I think you can see the way a true compulsive could work out a way to look through the
chunk list and sort file slices by file and see what was needed to do rewriting "in place"
with mimimized side buffers (very likely to be holdable in memory with todays sizes).
The cleverness probably wouldn't be totally INCREDIBLE, but it _would_ be a fair amount
of work for mostly nothing. It's not the old days any more, where you had to do such things ;-)

Plus, a failure in the middle would leave you in deep doodoo. So I made a close()
to accept a destination file name and a backup flag. It writes destfile.tmp and then
either renames or deletes destfile, and then renames destfile.tmp to destfile.
I should make it silent if there's no old file though:

 >>> vs.close('test3.txt',0) # 0==no backup
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "VMString.py", line 364, in close
     os.remove(destpath)
 OSError: [Errno 2] No such file or directory: 'test3.txt'

Um, or at least catch the right exception ;-) That means I didn't get to the renaming,
but we can easily check test3.txt.tmp:

 >>> print '-------\n%s-------' % VMS(path='test3.txt.tmp')
 -------
 Digits @[42:52], then alphabet @[54:80]:
 0123456789
 abcdZZZnoone.
 -------

BTW, the close() copies file slices, but you can configure a constant to control
max size copy buffering, so a huge file can be copied without loading it into memory.

I set it to 8192, but I can see a better way for the whole file: use vs.read(8192) to
consolidate chunks and write single large blocks on cluster boundaries (even if it
doesn't wind up reading on boundaries when it walks into large file slices).

I still have a bug or two, but it's close to working (having started thinking about it
when I saw your post). Right now it requires binary mode read/write, but I think it
would be possible to support text cooking and line oriented i/o, which might be nice
support for building an editor.

I did mention that you can init with a string as well, right?
 >>> import os
 >>> pi=VMS(os.popen(r'D:\python22\python C:\pywk\pi\meertens-pi-n.py 200').read())
 >>> print '%s' % pi[:50]
 31415926535897932384626433832795028841971693993751
 >>> print '%s' % pi[50:100]
 05820974944592307816406286208998628034825342117067

 >>> pi.chunks
 ['3141592653589793238462643383279502884197169399375105820974944592307816406286208998628034
 825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105
 5596446229489549303819']

And so forth ;-)

Regards,
Bengt Richter