Support for new items in set type

Alex Martelli aleax at mac.com
Sun Apr 22 01:14:43 EDT 2007


Prateek <surekap at gmail.com> wrote:

> I have a bit of a specialized request.
> 
> I'm reading a table of strings (specifically fixed length 36 char
> uuids generated via uuid.uuid4() in the standard library) from a file
> and creating a set out of it.
> Then my program is free to make whatever modifications to this set.
> 
> When I go back to save this set, I'd like to be able to only save the
> new items. Currently I am creating a copy of the set as soon as I load
> it and then when I go back to save it, i'm calculating the difference
> and saving just the difference.
> 
> There are many problems with this approach so far:
> 1) Calculating the difference via the standard set implementation is
> not very scalable -> O(n) I presume

Yep, you do have to look at every item, so O(N) is an obvious lower
bound.

> 2) Maintaining a copy wastes memory
> 3) I don't have a good solution if I delete items from the set
> (calculating the difference will return an empty set but I need to
> actually delete stuff).

(3) is easy -- the difference originalset-finalset is the set of things
you have to delete.

> Does anyone have any ideas?

Your problem is really a thinly disguised version of one of the most
ancient and venerable ones in data processing -- the "master file /
transaction file" problem.  With sets (which are built as hash tables)
you have very powerful weapons for this fray (it used to be, many years
ago, that sorting and "in-step processing of sorted files" was the only
feasible approach to such problems), but you still have to wield them
properly.

In your shoes, I would write a class whose instances hold three sets:
-- the "master set" is what you originally read from the file
-- the "added set" is the set of things you've added since then
-- the "deleted set" is the set of things you've deleted since them

You can implement a set-like interface; for example, your .add method
will need to remove the item from the deleted set (if it was there),
then add it to the added set (if it wasn't already in the master set);
and so on, for each and every method you actually desire (including no
doubt __contains__ for membership testing).  E.g., with suitably obvious
names for the three sets, we could have...:

class cleverset(object):
   ...
   def __contains__(self, item):
      if item in self.deleted: return False
      elif item in self.added: return True
      else: return item in self.master

When the time comes to save back to disk, you'll perform deletions and
additions based on self.deleted and self.added, of course.

I'm not addressing the issue of how to best represent such sets on disk
because it's not obvious to me what operations you need to perform on
the on-disk representation -- for example, deletions can be very costly
unless you can afford to simply set a "logically deleted" flag on
deleted items; but, depending on how often you delete stuff, that will
eventually degrade performance due to fragmentation, and maintaining the
indexing (if you need one) can be a bear.

Normally, I would use a good relational database (which has no doubt
already solved on my behalf all of these issues) for purpose of
persistent storage of such data -- usually sqlite for reasonably small
amounts of data, PostgreSQL if I need enormous scalability and other
"enterprise" features, but I hear other people are happy with many other
engines for relational DBs, such as Oracle (used to be superb if you
could afford full-time specialized db admins), IBM's DB2, SAP's database
(now opensourced), Firebird, even MySQL (I've never, but NEVER, had any
good experience with the latter, but I guess maybe it's only me, as
otherwise it could hardly be so popular).  The main point here is that a
relational DB's tables _should_ be already very well optimized for
storing "sets", since those table ARE exactly nothing but sets of tuples
(and a 1-item tuple is a tuple for all that:-).


Alex



More information about the Python-list mailing list