[Cython] Hash-based vtables

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Sat Jun 9 08:00:50 CEST 2012


On 06/09/2012 07:45 AM, Dag Sverre Seljebotn wrote:
> On 06/09/2012 03:21 AM, Robert Bradshaw wrote:
>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn
>>> There's still the indirection through SEP 200 (extensibletype slots).
>>> We can
>>> get rid of that very easily by just making that table and the
>>> hash-vtable
>>> one and the same. (It could still either have interned string keys or ID
>>> keys depending on the least significant bit.)
>>
>> Or we can even forgo the interning for this table, and give up on
>> partitioning the space numerically and just use the dns-style
>> prefixing, e.g. "org.cython.X" belongs to us.
>
> Huh? Isn't that when you *need* interning? Do you plan on key-encoding
> those kind of strings into 64 bits?
>
> (I think it would usually be "method:foo:ii->d" (or my current
> preference is "method:foo:i4i8->f8"))

Well, I guess something like "org.cython.X" would happen often as well, 
in addition. Just put it all in the same table :-)

>
> Partitioning the space numerically you'd just hash the number; "SEP 260:
> We use id 0x70040001, which has lower-64-md5 0xfa454a...ULL".

The real use-case I see for this now is in having the PyArray_DATA etc. 
access pointers simply through compile-time constants the library can 
define on both ends. It could just do

PyCustomSlots_Lookup(obj->ob_type, 0x70040001, 0xfa45323...ULL)

specifically to get a function retrieving the data-pointer. 
PyArray_SHAPE would do

PyCustomSlots_Lookup(obj->ob_type, 0x70040002, 0xbbad423...ULL)

Also, I'd want PyExtensibleType_Object to have:

{
     ...
     PyPerfectTable *tp_perfect_table;
     Py_ssize_t tp_perfect_table_obj_offset;
}

i.e. we allow for getting quickly to a table on the object in addition 
to the one on the type.

Callbacks look up the one on the object first (before potentially 
checking for __call__ in the type); method-calling might ignore the one 
on the object.

Dag

>
>> There is value in the double indirection if this (or any of the other)
>> lookup tables are meant to be modified over time.
>
> This isn't impossible with a hash table either. You just need to
> reallocate a little more often than what would be the case with a
> regular hash table, but not dramatically so (you need to rehash whenever
> the element to insert hashes into a "large" bin, which are rather few).
>
> I want the table to have a pointer to it, so that you can atomically
> swap it out.
>
>>> To wrap up, I think this has grown in complexity beyond the "simple SEP
>>> spec". It's at the point where you don't really want to have several
>>> libraries implementing the same simple spec, but instead use the same
>>> implementation.
>>>
>>> But I think the advantages are simply too good to give up on.
>>>
>>> So I think a viable route forward is to forget the
>>> CEP/SEP/pre-PEP-approach
>>> for now (which only works for semi-complicated ideas with simple
>>> implementations) and instead simply work more directly on a library. It
>>> would need to have a couple of different use modes:
>>
>> I prefer an enhancement proposal with a spec over a library, even if
>> only a single library gets used in practice. I still think it's simple
>> enough. Basically, we have the "lookup spec" and then a CEP for
>> applying this to fast callable (agreeing on signatures, and what to do
>> with extern types) and extensible type slots.
>
> OK.
>
>>
>>> - A Python perfect-hasher for use when generating code, with only the a
>>> string interner based on CPython dicts and extensibletype metaclass as
>>> runtime dependencies (for use in Cython). This would only add some
>>> hundred
>>> source file lines...
>>>
>>> - A C implementation of the perfect hashing exposed through a
>>> PyPerfectHashTable_Ready(), for use in libraries written in C like
>>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C
>>> implementation of the perfect hashing.
>>>
>>> And on the distribution axis:
>>>
>>> - Small C header-style implementation of a string interner and the
>>> extensibletype metaclass, rendezvousing through sys.modules
>>>
>>> - As part of the rendezvous, one would always try to __import__ the
>>> *real*
>>> run-time library. So if it is available in sys.path it overrides
>>> anything
>>> bundled with other libraries. That would provide a way forward for
>>> GIL-less
>>> string interning, a Python-side library for working with these tables
>>> and
>>> inspecting them, etc.
>>
>> Hmm, that's an interesting idea. I think we don't actually need
>> interning, in which case the "full" library is only needed for
>> introspection.
>
> You don't believe the security concern is real then? Or do you want to
> pay the cost for a 160-bit SHA1 compare everywhere?
>
> I'd love to not do interning, but I see no way around it.
>
> BTW, a GIL-less interning library isn't rocket science. I ran khash.h
> through a preprocessor with
>
> KHASH_MAP_INIT_STR(str_to_entry, entry_t)
>
> and the result is 180 lines of code for the hash table. Then pythread.h
> provides the thread lock, another 50 lines for the interning logic
> (intern_literal, intern_heap_allocated, release_interned).
>
> It just seems a little redundant to ship such a thing in every
> Cython-generated file since we hold the GIL during module loading anyway.
>
> Dag
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel



More information about the cython-devel mailing list