[Cython] Hash-based vtables

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Sat Jun 9 08:02:07 CEST 2012


On 06/09/2012 08:00 AM, Dag Sverre Seljebotn wrote:
> On 06/09/2012 07:45 AM, Dag Sverre Seljebotn wrote:
>> On 06/09/2012 03:21 AM, Robert Bradshaw wrote:
>>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn
>>>> There's still the indirection through SEP 200 (extensibletype slots).
>>>> We can
>>>> get rid of that very easily by just making that table and the
>>>> hash-vtable
>>>> one and the same. (It could still either have interned string keys
>>>> or ID
>>>> keys depending on the least significant bit.)
>>>
>>> Or we can even forgo the interning for this table, and give up on
>>> partitioning the space numerically and just use the dns-style
>>> prefixing, e.g. "org.cython.X" belongs to us.
>>
>> Huh? Isn't that when you *need* interning? Do you plan on key-encoding
>> those kind of strings into 64 bits?
>>
>> (I think it would usually be "method:foo:ii->d" (or my current
>> preference is "method:foo:i4i8->f8"))
>
> Well, I guess something like "org.cython.X" would happen often as well,
> in addition. Just put it all in the same table :-)
>
>>
>> Partitioning the space numerically you'd just hash the number; "SEP 260:
>> We use id 0x70040001, which has lower-64-md5 0xfa454a...ULL".
>
> The real use-case I see for this now is in having the PyArray_DATA etc.
> access pointers simply through compile-time constants the library can
> define on both ends. It could just do
>
> PyCustomSlots_Lookup(obj->ob_type, 0x70040001, 0xfa45323...ULL)
>
> specifically to get a function retrieving the data-pointer.
> PyArray_SHAPE would do
>
> PyCustomSlots_Lookup(obj->ob_type, 0x70040002, 0xbbad423...ULL)

Argh. I meant 0x70040002 | 1, of course ;-)

DS

>
> Also, I'd want PyExtensibleType_Object to have:
>
> {
> ...
> PyPerfectTable *tp_perfect_table;
> Py_ssize_t tp_perfect_table_obj_offset;
> }
>
> i.e. we allow for getting quickly to a table on the object in addition
> to the one on the type.
>
> Callbacks look up the one on the object first (before potentially
> checking for __call__ in the type); method-calling might ignore the one
> on the object.
>
> Dag
>
>>
>>> There is value in the double indirection if this (or any of the other)
>>> lookup tables are meant to be modified over time.
>>
>> This isn't impossible with a hash table either. You just need to
>> reallocate a little more often than what would be the case with a
>> regular hash table, but not dramatically so (you need to rehash whenever
>> the element to insert hashes into a "large" bin, which are rather few).
>>
>> I want the table to have a pointer to it, so that you can atomically
>> swap it out.
>>
>>>> To wrap up, I think this has grown in complexity beyond the "simple SEP
>>>> spec". It's at the point where you don't really want to have several
>>>> libraries implementing the same simple spec, but instead use the same
>>>> implementation.
>>>>
>>>> But I think the advantages are simply too good to give up on.
>>>>
>>>> So I think a viable route forward is to forget the
>>>> CEP/SEP/pre-PEP-approach
>>>> for now (which only works for semi-complicated ideas with simple
>>>> implementations) and instead simply work more directly on a library. It
>>>> would need to have a couple of different use modes:
>>>
>>> I prefer an enhancement proposal with a spec over a library, even if
>>> only a single library gets used in practice. I still think it's simple
>>> enough. Basically, we have the "lookup spec" and then a CEP for
>>> applying this to fast callable (agreeing on signatures, and what to do
>>> with extern types) and extensible type slots.
>>
>> OK.
>>
>>>
>>>> - A Python perfect-hasher for use when generating code, with only the a
>>>> string interner based on CPython dicts and extensibletype metaclass as
>>>> runtime dependencies (for use in Cython). This would only add some
>>>> hundred
>>>> source file lines...
>>>>
>>>> - A C implementation of the perfect hashing exposed through a
>>>> PyPerfectHashTable_Ready(), for use in libraries written in C like
>>>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C
>>>> implementation of the perfect hashing.
>>>>
>>>> And on the distribution axis:
>>>>
>>>> - Small C header-style implementation of a string interner and the
>>>> extensibletype metaclass, rendezvousing through sys.modules
>>>>
>>>> - As part of the rendezvous, one would always try to __import__ the
>>>> *real*
>>>> run-time library. So if it is available in sys.path it overrides
>>>> anything
>>>> bundled with other libraries. That would provide a way forward for
>>>> GIL-less
>>>> string interning, a Python-side library for working with these tables
>>>> and
>>>> inspecting them, etc.
>>>
>>> Hmm, that's an interesting idea. I think we don't actually need
>>> interning, in which case the "full" library is only needed for
>>> introspection.
>>
>> You don't believe the security concern is real then? Or do you want to
>> pay the cost for a 160-bit SHA1 compare everywhere?
>>
>> I'd love to not do interning, but I see no way around it.
>>
>> BTW, a GIL-less interning library isn't rocket science. I ran khash.h
>> through a preprocessor with
>>
>> KHASH_MAP_INIT_STR(str_to_entry, entry_t)
>>
>> and the result is 180 lines of code for the hash table. Then pythread.h
>> provides the thread lock, another 50 lines for the interning logic
>> (intern_literal, intern_heap_allocated, release_interned).
>>
>> It just seems a little redundant to ship such a thing in every
>> Cython-generated file since we hold the GIL during module loading anyway.
>>
>> Dag
>> _______________________________________________
>> cython-devel mailing list
>> cython-devel at python.org
>> http://mail.python.org/mailman/listinfo/cython-devel
>



More information about the cython-devel mailing list