Persistent objects

Bengt Richter bokr at oz.net
Sun Dec 12 18:06:07 EST 2004


On 12 Dec 2004 01:54:28 -0800, Paul Rubin <http://phr.cx@NOSPAM.invalid> wrote:

>I've had this recurring half-baked desire for long enough that I
>thought I'd post about it, even though I don't have any concrete
>proposals and the whole idea is fraught with hazards.
>
>Basically I wish there was a way to have persistent in-memory objects
>in a Python app, maybe a multi-process one.  So you could have a
>persistent dictionary d, and if you say 
>   d[x] = Frob(foo=9, bar=23)
>that creates a Frob instance and stores it in d[x].  Then if you
>exit the app and restart it later, there'd be a way to bring d back
>into the process and have that Frob instance be there.
I've had similar thoughts. Various related topics have come up on clp.
I speculated on getting a fast start capability via putting the entire
heap etc state of python in an mmap-ed region and having a checkpoint
function you could call sort of like a yield to goes under the hood
and writes the state to a file. Then you would restart it via
python -resume savedstate and instead of loading normally, python would
load its state from savedstate and appear to continue from the "yeild"
that caused the checkpointing.

Of course that has a lot of handwaving content, but the idea of image-wise
saving state is similar to what you want to do I think.

But I think you have to think about what id(Frob(foo=9, bar=23)) means,
because that is basically what you are passing to d (along with id(x) above).

For speed you really don't want d to be copying Frob immutable representations
from heap memory to mmap memory, you want Frob to be created in mmap memory
to start with (as I think you were saying). But this requires specifying that
Frob should behave that way, one way or another. If we get class decorators,
maybe we could write

    @persistent(mmapname)
    class Frob(object):
        ...

but in the mean while we could write
    class Frob(object):
        ...
    Frob = persistent(mmapname)(Frob)

The other way is to modify the Frob code to inherit from Persistent(mmapname)
or such. BTW, mmapname IMO should not name a file directly, or you get horrible
coupling of the code to a particular site. I think there should be a standard
place to store the mapping from names to files, similar to sys.modules, so that
mmapnames can be used as abstract mmap space specifiers. In fact, the mapping
could use sys.modules by being in a standard module for registering mmapnames,
backed by a persistent config file (or we get recursion ?;-)

Thinking out loud here ...

So what could Frob = persistent(mmapnmame)(Frob) do with Frob to make Frob
objects persist and what will foo = Frob(...) mean?
foo is an ordinary name bound to the Frob instance.
But the binding here is between a transient name in the local name space
and a persistent Frob instance. So I think we need a Frob instance proxy
that implements an indirect reference to the persistent data using a persistent
data id, which could be an offset into the mmap file where the representation
is stored, analogous to id's being memory addresses. But the persistent representation
has to include type info also, which can't have RAM memory references in it, if it's
to be shared -- unless maybe if you do extreme magic for sharing, like debugger code
and relocating loaders etc.

Now if you wrote

    @persistent(mmapname)
    class PD(dict): pass
    ...
    d = PD()
    d[x] = frobinst

then if persistent was smart enough to recognize some useful basic types like dict,
then d would be a transient binding to a persistent dict proxy which could recognize
persistent object proxies as values and just get the persistent id and use that instead
of creating a new persistent representation, or if the value was a reference to an ordinary
immutable, a persistent copy could be made in mmap space, and the offset/id or _that_ would
be used as the value ref in the persistent representation of d. Similarly with the key.

d.__setitem__(key, value) would not accept a reference to an ordinary mutable value object
unless maybe it had a single reference count, indicating that it was only constructed to
pass as an argument. In that case, if persistent(mmapname)type(themutable)) succeeded, then
a representation could be created in the mmapname space and the mmap offset/id could be
used as the value ref in the d hash association with the key done again similarly.

I feel like this is doable, if you don't get too ambitious to start ;-)
The tricky parts will be getting performace with proxies checking in-RAM cached
representations vs in-mmap-RAM representations, and designing representations to make
that happen.

If it's worth it ;-) Don't good os file systems already have lru caching of hot info,
so how much is there to gain over a light weight data base's performance?


>
>Please don't suggest using a pickle or shelve; I know about those
>already.  I'm after something higher-performance.  Basically d would
>live in a region of memory that could be mmap'd to a disk file as well
>as shared with other processes.  One d was rooted into that region,
>any entries created in it would also be in that region, and any
>objects assigned to the entries would also get moved to that region.
UIAM heap objects would be hard to move unless they had ref counts of 1
-- and that only if ref count of of 1 was implemented to indentify the
referrer. Or totally rework garbage collection etc. And as mentioned,
direct references from the mmap region to ordinary RAM locations wouldn't
fly, since the latter are not persistent, but can't be moved unless other
references are updated. For checkpointing it would be different, becuase
it's not sharing.

>
>There'd probably have to be a way to lock the region for update, using
>semaphores.  Ordinary subscript assignments would lock automatically,
>but there might be times when you want to update several structures in
>a single transaction.
Definitely there would have to be a mutex, and one that could be accessed
by name between programs.
>
>A thing like this could save a heck of a lot of SQL traffic in a busy
>server app.  There are all kinds of bogus limitations you see on web
>sites, where you can't see more than 10 items per html page or
>whatever, because they didn't want loading a page to cause too many
>database hits.  With the in-memory approach, all that data could be
>right there in the process, no TCP messages needed and no context
>switches needed, just ordinary in-memory dictionary references.  Lots
>of machines now have multi-GB of physical memory which is enough to
>hold all the stuff from all but the largest sites.  A site like
>Slashdot, for example, might get 100,000 logins and 10,000 message
>posts per day.  At a 1k bytes per login (way too much) and 10k bytes
>per message post (also way too much), that's still just 200 megabytes
>for a full day of activity.  Even a low-end laptop these days comes
>with more ram than that, and multi-GB workstations are no big deal any
>more.  Occasionally someone might look at a several-day-old thread and
>that might cause some disk traffic, but even that can be left in
>memory (the paging system can handle it).
OTOH I think the danger of premature optimization is ever present. What info
do you have re actual causes of overhead? And are you looking at mostly
read-only or a lot of r/w activity?

If there is a use for this, do you really need the generality of Frob,
or would a d[x]=y that only allowed x and y as strings, but was fast,
be useful? I think the latter would not be that hard to implement. Basically
a string repository plus some presistent representation of a hash table
associating key strings with value strings, and locking provisions.

>
>On the other hand, there'd either have to be interpreter hair to
>separate the persistent objects from the non-persistent ones, or else
>make everything persistent and then have some way to keep processes
>sharing memory from stepping on each other.  Maybe the abstraction
>machinery in PyPy can make this easy.
>
>Well, as you can see, this idea leaves a lot of details not yet
>thought out.  But it's alluring enough that I thought I'd ask if
>anyone else sees something to pursue here.

The strings-only version would let you build various pickling on top of
that for other objects, and there wouldn't be so much re-inventing to do ;-)
That seems like an evening's project, to get a prototype. But no time now...

Regards,
Bengt Richter



More information about the Python-list mailing list