sorting many arrays from one...

Tue Jul 9 11:15:20 EDT 2002

Shagshag13 wrote:

> 
> "Alex Martelli" <aleax at aleax.it>
> 
>> I'm not sure how that would save you any storage at all.  Since the
>> strings DO need to be stored somewhere anyway, what's saved by using
>> their id's rather than just references to them, as Python usually does?
> 
> i should have write that i use the same string many times in many files,
> so i only use its id which is always shorter than my strings and save
> storage. And i was wondering if there was a better/efficient way to handle
> id-string mapping.

If you never need to flush the memory used for such strings, Python's
built-in intern function is an excellent solution.  Unfortunately,
interned strings DO become "immortal".

If the ability to flush the structure periodically (in order to recover
all storage) is indeed a must, then you can do better than the code you
posted only in performance-marginal aspects (e.g., avoiding the horrid
idea of type-checking -- but that won't speed things up) except by
using string references rather than arbitrary integers, which would
save some more meemory AND time.

A reference to a string takes no more memory than a reference to an
int, after all.  As long as the string object proper is only stored
once, having N more references to it from various modules is no more
costly (normally LESS costly in all aspects) than having N references
to an int instead.

If you're comfortable reasoning in C terms...: anything you see in
Python as a variable, an item in a list, etc, is a PyObject* -- a
pointer to a PyObject.  The pointer itself takes 4 bytes on a typical
32-bit machine, whatever kind of object is involved.  Python does
not make IMPLICIT copies -- assignment, argument passing, etc, all
store and/or sling around REFERENCES, i.e., PyObject* things.  An
object stays alive as long as its reference count is > 0 (except
when it's in a garbage-loop that may be collected by GC, but that's
another issue that doesn't really matter here).

So, say you have several strings all with the same value, e.g.
'disestablishmentarianism', obtained at various times, from various
sources, in various modules.  If you do nothing special you'll
take up N*(24+x) bytes for the objects themselves (x is a small
extra overhead for each string object, say another 8 I think)
plus N*4 bytes for the references, say N(24+8+4) = N*36 or a bit
more.  So, it's sure worthwhile to cut this down to N*4+K for
some fixed K -- ensuring the value 'disestablishmentarianism'
is stored only once and all references to that value refer to
the very same Python object.  But there's no reason to bring
_integers_ into the picture...!

class StringStore:
    def __init__(self): self.store = {}
    def add(self, s): return self.store.setdefault(s,s)
    def flush(self): self.store.clear()

There -- instantiate one of these, once, and just be sure to
pass all strings that need to be stored through the StringStore
method of that one instance.  This will actually work just
as well for tuples of immutable objects, unicode strings,
complex numbers, and so on, but this of those as just nice 
side effects...:-).

Do you see why this works, now...?  after ss=StringStore(),
any ss.add(somestring) returns EITHER the string argument,
OR another string equal to that one if one was already in
ss.store -- so, somestring = ss.add(somestring) either
leaves variable somestring alone, or rebinds it to refer to
an identical value that's already stored elsewhere.  No need
to think of integers or anything like that, right...?

The dict's method setdefault is just a shortcut to make this
tiny and greased-lightning fast, of course.  It has much
the same semantics as
    if s in self.store: return self.store[s]
    self.store[s] = s
    return s
but is much faster and slicker:-).

Alex