[Cython] Hash-based vtables

Sun Jun 10 10:43:29 CEST 2012

On 06/10/2012 10:23 AM, Robert Bradshaw wrote:
> On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> On 06/10/2012 09:34 AM, Robert Bradshaw wrote:
>>>
>>> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn
>>> <d.s.seljebotn at astro.uio.no>    wrote:
>>>>
>>>>
>>>>
>>>> Robert Bradshaw<robertwb at gmail.com>    wrote:
>>>>
>>>>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
>>>>> <d.s.seljebotn at astro.uio.no>    wrote:
>>>>>>
>>>>>> I'd love to not do interning, but I see no way around it.
>>>>>
>>>>>
>>>>> No, I want to use the lower 64 bits by default, but always have the
>>>>> top 96 bits around to allow using this mechanism in "secure" mode at a
>>>>> slight penalty. md5 is out because there are known collisions. (Yes,
>>>>> sha-1 may succumb sooner rather than later, theoretical weaknesses
>>>>> have been shown, so we could look to using something else (hopefully
>>>>> still shipped with Python).
>>>>
>>>>
>>>> But very few users are going to know about this. What's the odds that the
>>>> user who decide to trigger JIT-compilation with function signatures that
>>>> varies based on the input will know about the option and turn it on and also
>>>> recompile all his/her C extension modules?
>>>>
>>>> In practice, such an option would always stay at its default value. If we
>>>> leave it to secure by default and start teaching it to users from the
>>>> start...but that's a big price to pay.
>>>
>>>
>>> Yes, it's not ideal from this perspective.
>>>
>>>> And if you *do* want to run in secure mode, it will be a lot slower than
>>>> interning.
>>>
>>>
>>> Are you thinking that the 64-bit interned pointer would be used as the
>>> hash? In this case all hashtables would have to be constructed at
>>> runtime, which means it needs to be really, really cheap (well under a
>>> milisecond, I'm sure Sage has>1000 classes,>10000 methods it imports
>>> at startup). Also I'm not sure how the very-uneven distribution would
>>> play out for constructing perfect hastables (perhaps it won't hurt,
>>> there's likely to be long runs of consecutive values in some cases.
>>
>>
>> No, I'm thinking that callsites need both the 64-bit interned char* and the
>> 64-bit hash of the *contents*. They use the hash to figure out the position,
>> then compare by ID.
>
> Ah, I missed that bit. OK, yes, that could work well.

Ah, we've been talking past one another for some time then. OK, let's 
settle on that.

>
>> The hash is not stored in callees, it's discarded after figuring out the
>> table layout.
>>
>> (There was this idea that if the char* has least significant bit set, we'd
>> hash it directly rather than dereference it, but let's ignore that for now.)
>
> (For the purpose of this discussion, it's part of the "interning" step.)
>
>> I don't think under a millisecond is unfeasible to hash smallish tables --
>> we could put the pointer through a cheap hash to create more entropy (for
>> the perfect hashing, being able to select a hash function through the>>r is
>> important, so you can't just use the pointer directly -- but there are
>> functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/)
>
> Just a sec, we're not hashing pointers, but the full signature itself,
> right? For our hash function we need
>
> (1) Collision free on 64-bits (for non-malicious use).
> (2) Good distribution (including for short strings, which is harder to come by).
> (2b) Any small subset of bits should have property (2).
> (3) Ideally easy to reference (e.g. "md5" is better than "these 100
> lines of C code").
>
> Cheap runtime construction is still ideal, but much less of an issue
> if hashes (and perfect tables) can be constructed at compile time,
> which I think this scheme allows.

Yes, 64 bits of md5 then? ulib contains "100 lines of C code" for md5 
anyway, if one doesn't want to go through Python hashlib (I imagine e.g. 
hashlib might be unavailable somewhere as it relies on openssl and 
there's license war going on vs. gnutls and so on. And the md5 module is 
deprecated.).

>
>> That would save us a register and make the instructions shorter in some
>> places I guess...I think it's really miniscule, it's not like the effect of
>> load of a global variable. But if you like this approach I can benchmark
>> C-written hashtable creation and see.
>
> This will have value in and of itself (both the implementation and the
> benchmarks).

Will do (eventually, less spare time in coming week).

About signatures, a problem I see with following the C typing is that 
the signature "ill" wouldn't hash the same as "iii" on 32-bit Windows 
and "iqq" on 32-bit Linux, and so on. I think that would be really bad. 
"l" must be banished -- but then one might as well do "i4i8i8".

Designing a signature hash where you select between these at 
compile-time is perhaps doable but does generate a lot of code and makes 
everything complicated. I think we should just start off with hashing at 
module load time when sizes are known, and then work with heuristics 
and/or build system integration to improve on that afterwards.

Dag