[Python-Dev] new unicode hash calculation

M.-A. Lemburg mal@lemburg.com
Mon, 10 Jul 2000 19:30:20 +0200


Fredrik Lundh wrote:
> 
> mal wrote:
> 
> > * change hash value calculation to work on the Py_UNICODE data
> >   instead of creating a default encoded cached object (what
> >   now is .utf8str)
> 
> it this what you had in mind?
> 
> static long
> unicode_hash(PyUnicodeObject *self)
> {
>     register int len;
>     register Py_UNICODE *p;
>     register long x;
> 
>     if (self->hash != -1)
>         return self->hash;
>     len = PyUnicode_GET_SIZE(self);
>     p = PyUnicode_AS_UNICODE(self);
>     x = *p << 7;
>     while (--len >= 0)
>         x = (1000003*x) ^ *p++;
>     x ^= a->ob_size;
>     if (x == -1)
>         x = -2;
>     self->hash = x;
>     return x;
> }
> 
> </F>

Well, sort of. It should be done in such a way that Unicode
strings which only use the lower byte produce the same hash
value as normal 8-bit strings -- is this the case for the
above code ?

My first idea was to apply a kind of two pass scan which
first only uses the lower byte and then the higher byte to
calculate a hash value. Both passes would use the same
algorithm as the one for 8-bit strings.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/