Support for new items in set type

Sun Apr 22 06:50:45 EDT 2007

> > 2) Maintaining a copy wastes memory
> > 3) I don't have a good solution if I delete items from the set
> > (calculating the difference will return an empty set but I need to
> > actually delete stuff).
>
> (3) is easy -- the difference originalset-finalset is the set of things
> you have to delete.

Now why didn't I think of that?

> > Does anyone have any ideas?
>
> Your problem is really a thinly disguised version of one of the most
> ancient and venerable ones in data processing -- the "master file /
> transaction file" problem.  With sets (which are built as hash tables)
> you have very powerful weapons for this fray (it used to be, many years
> ago, that sorting and "in-step processing of sorted files" was the only
> feasible approach to such problems), but you still have to wield them
> properly.
>
> In your shoes, I would write a class whose instances hold three sets:
> -- the "master set" is what you originally read from the file
> -- the "added set" is the set of things you've added since then
> -- the "deleted set" is the set of things you've deleted since them
>
> You can implement a set-like interface; for example, your .add method
> will need to remove the item from the deleted set (if it was there),
> then add it to the added set (if it wasn't already in the master set);
> and so on, for each and every method you actually desire (including no
> doubt __contains__ for membership testing).  E.g., with suitably obvious
> names for the three sets, we could have...:
>
> class cleverset(object):
>    ...
>    def __contains__(self, item):
>       if item in self.deleted: return False
>       elif item in self.added: return True
>       else: return item in self.master
>
> When the time comes to save back to disk, you'll perform deletions and
> additions based on self.deleted and self.added, of course.

Ok. This class sounds like a good idea. I'll try it.

> I'm not addressing the issue of how to best represent such sets on disk
> because it's not obvious to me what operations you need to perform on
> the on-disk representation -- for example, deletions can be very costly
> unless you can afford to simply set a "logically deleted" flag on
> deleted items; but, depending on how often you delete stuff, that will
> eventually degrade performance due to fragmentation, and maintaining the
> indexing (if you need one) can be a bear.

I actually don't have very many transactions I need to perform on
disk. Once I've loaded these sets, I normally use the contains
operation and union/intersection/difference operations.
I can definitely use a tombstone on the record to mark it as logically
deleted. The problem right now is that I don't need indexing within
sets (I either load the whole set or don't bother at all) - I'm only
indexing the entire file (one entry in the index per tablespace). So
even if I use tombstones, I'll have to still go through all the
tablespace to find the tombstoned files. Alternatively, I could write
the deleted data at the end of the tablespace and mark it with the
tombstone (since I don't care about space so much) and when I load the
data, I can eliminate the tombstoned entries. I'm give that a try and
report my performance results on this thread - I absolutely love that
I can make this important a change in one or two days using Python!

> Normally, I would use a good relational database (which has no doubt
> already solved on my behalf all of these issues) for purpose of
> persistent storage of such data -- usually sqlite for reasonably small
> amounts of data, PostgreSQL if I need enormous scalability and other
> "enterprise" features, but I hear other people are happy with many other
> engines for relational DBs, such as Oracle (used to be superb if you
> could afford full-time specialized db admins), IBM's DB2, SAP's database
> (now opensourced), Firebird, even MySQL (I've never, but NEVER, had any
> good experience with the latter, but I guess maybe it's only me, as
> otherwise it could hardly be so popular).  The main point here is that a
> relational DB's tables _should_ be already very well optimized for
> storing "sets", since those table ARE exactly nothing but sets of tuples
> (and a 1-item tuple is a tuple for all that:-).

Thanks Alex, but we're actually implementing a (non-relational)
database engine. This particular file is one of many files (which are
all kept in sync by the engine) I need to persist (some files use
other methods - including pickling - for persistence). I've already
solved most of our other problems and according to hotshot, this is
one of the next hurdles (this commit is responsible for 8% of the
total CPU time @ 3000 items and 3 files).
If you're interested, you can check it out at http://www.brainwavelive.com
or email me.

Prateek