[Tutor] For - if - else loop; print selective output

Fri Oct 26 18:58:47 CEST 2012

Sorry, Saad; this is now *way* off topic...

On Thu, Oct 25, 2012 at 5:39 PM, Oscar Benjamin
<oscar.j.benjamin at gmail.com> wrote:
>
>> degrades if there are many collisions). On average, you can check if a
>> set/dict contains an item in constant time, i.e. O(1). The amortized
>> worst case is O(n).
>
> Why do you say "*amortized* worst case"? Is there an occasional worse
> than O(n) operation that is insignificant when amortised?
>
> At first I assumed that was a slip of the tongue

Yes, it was badly stated. I've only implemented "separate chaining"
hash tables that use slots that are linked lists or dynamic arrays.
Python uses an "open addressing" implementation. It seems to me with
this implementation the worst case lookup is o(n), where n is the
table size. In the worst case you've removed all but 1 key, where all
the keys hash the same, and the lookup requires probing through every
"dummy key" left in the table (I know this is insanely unrealistic).
With separate chaining, on the other hand, the worst case lookup is
O(k), where k is the number of keys.

Here's a concrete example to give a feel for what the hash table looks
like as keys are added and removed. First, define some structures to
peek at the underlying table of a set, plus a pathological class whose
instances always hash as 37:

http://hg.python.org/cpython/file/70274d53c1dd/Include/setobject.h#l21

    from ctypes import *

    Set_MINSIZE = 8

    class SetEntry(Structure):
        _fields_ = [
            ('hash', c_long),
            ('_key', c_void_p),
        ]

        @property
        def key(self):
            if self._key:
                return cast(self._key, py_object)
            else:
                return 0

    class SetObject(Structure):
        _fields_ = [
            ('ob_refcnt', c_ssize_t),
            ('ob_type', py_object),
            ('fill', c_ssize_t),
            ('used', c_ssize_t),
            ('mask', c_ssize_t),
            ('table', POINTER(SetEntry)),
            ('lookup', c_void_p),
            ('smalltable', SetEntry * Set_MINSIZE),
            ('hash', c_long),
            ('weakreflist', c_void_p),
        ]

        @classmethod
        def from_set(cls, s):
            if not isinstance(s, set):
                raise TypeError('must be a set')
            self = cls.from_address(id(s))
            self._set = s  # keep a reference
            return self

        def __iter__(self):
            for entrynum in xrange(self.mask + 1):
                entry = self.table[entrynum]
                yield entry.key

    class C(object):
        def __hash__(self):
            return 37

        def __repr__(self):
            return 'C %d' % id(self)

Add 20 of these pathological objects to a set and then remove all but
1 of them. The table now has 19 dummy keys, 1 used slot, and 12 unused
slots:

    >>> c_list = [C() for i in range(20)]
    >>> s1 = set(c_list)
    >>> s2 = s1.copy()
    >>> c_keep = c_list.pop(10)
    >>> for c in c_list: s1.remove(c)

    >>> so1 = SetObject.from_set(s1)
    >>> list(so1)
    [py_object('<dummy key>'), py_object('<dummy key>'),
    py_object('<dummy key>'), 0, py_object('<dummy key>'),
    py_object('<dummy key>'), py_object('<dummy key>'),
    py_object('<dummy key>'), py_object('<dummy key>'),
    py_object('<dummy key>'), py_object('<dummy key>'),
    py_object('<dummy key>'), 0, py_object('<dummy key>'),
    py_object('<dummy key>'), 0, 0, 0, py_object('<dummy key>'),
    py_object('<dummy key>'), 0, py_object(C 3072909996), 0, 0,
    0, 0, 0, py_object('<dummy key>'), py_object('<dummy key>'),
    py_object('<dummy key>'), 0, py_object('<dummy key>')]

Since all the objects hash the same, the number of dummy keys it will
have to skip before finding c_keep will vary depending on the table
history. I did a few timeit tests on large pathological sets, and a
lookup of the remaining key took anywhere from 4 to 25 times longer
when the table was in this state.

In contrast, using difference_update() resizes and cleans up the table
because it's more than 1/5 dummies (it resizes down to using the 8
element "smalltable" in the set object).

    >>> s2.difference_update(c_list)
    >>> so2 = SetObject.from_set(s2)
    >>> list(so2)
    [0, 0, 0, 0, 0, py_object(C 3072909996), 0, 0]

Inserting keys can also trigger a resize if 3*fill > 2*(mask + 1). In
other words, it resizes if the table is more than 2/3 used, where
"fill" is active+dummy and mask+1 is the table length. In this case if
th numbers 3 and 12 are added to the set, they hash to the unused
table slots 3 and 12 (adding more C instances would just refill the
dummy slots). This increases "fill" to 22, and triggers a table
resize. It's resized to the first power of 2 greater than 4*used ==
4*3 = 12. That's 16 in this case. (Pedantically, it's a power of 2
times PySet_MINSIZE, which is 8.)

    >>> s1.add(3)
    >>> s1.add(12)
    >>> list(so1)
    [0, 0, 0, py_object(3), 0, py_object(C 3072967820), 0, 0, 0, 0,
    0, 0, py_object(12), 0, 0, 0]