[Python-Dev] RE: Unicode character name hashing

Tim Peters tim_one@email.msn.com
Sat, 15 Jul 2000 21:52:19 -0400


[Bill Tutt]
> I just had a rather unhappy epiphany this morning.
> F1, and f2 in ucnhash.c might not work on machines where
> sizeof(long) != 32 bits.

If "not work" means "may not return the same answer as when a long does have
exactly 32 bits", then yes, it's certain not to work.  Else I don't know --
I don't understand the (undocumented) postconditions (== what does "work"
mean, exactly?) for these functions.

If getting the same bits is what's important, f1 can be repaired by
inserting this new block:

    /* cut back to 32 bits */
    x &= 0xffffffffL;
    if (x & 0x80000000L) {
        /* if negative as a 32-bit thing, extend sign bit to full precision
*/
        x -= 0x80000000L;  /* subtract 2**32 in a portable way */
        x -= 0x80000000L;  /* by subtracting 2**31 twice */
    }

between the existing
    x ^= cch + 10;
and
    if (x == -1)

This assumes that negative numbers are represented in 2's-complement, but
should deliver the same bits in the end on any machine for which that's true
(I don't know of any Python platform for which it isn't).  The same shoe
work for f2 after replacing its negative literal with a 0x...L bit pattern
too.

The assumption about 2's-comp, and the new "if" block, could be removed by
making these functions compute with and return unsigned longs instead.  I
don't know why they're using signed longs now (the bits produced are exactly
the same either way, up until the "%" operation, at which point C is
ill-defined when using signed long).

BTW, you can test stuff like this on Win32 by cloning the function and using
_int64 instead of long in the copy, then see whether they get the same
results.