[Python-Dev] Re: test_unicode_file failing on Mac OS X

Thu Dec 11 09:27:59 EST 2003

I naïvely wrote:
 >Could we perhaps use a comparison that, in effect, did:
 >     def uni_equal(first, second):
 >         if first == second:
 >             return True
 >         return first.normalize() == second.normalize()
 >That is, take advantage of the fact that normalization is often
 >unnecessary for "trivial" reasons.

This works, and a similar "unequal" trick may be constructible.
Ordering is certainly trickier (assuring we have a total order
given new equalities, so that we cannot choose a, b, and c where:
     a < b = c > a   is  True.

But, Martin v. Löwis points out:
> It also affects hashing, if Unicode objects are used as dictionary
> keys. Objects that compare equal need to hash equal.

Still not disgusting, _but_ unicode strings must hash equal to
the corresponding "plain" string.  I am not certain about this
requirement for non-ASCII characters, but I expect we are stuck
with matching hashes in the range ord(' ') through ord('~') and
probably for all character values from 0 through 127.  We might
be able to classify UTF-16 code units into three groups:
   1) matches base ASCII character
   2) diacritical or combining
   3) definitely distinct from any ASCII or combining form.
If we map the group 1 entries to the corresponding ASCII code,
skip the group 2s, and take the group 3s separately (probably
remapping to another set), we might come up with a hash that
used only the map results as elements contributing to the hash.

Are we stuck with the current hash for unicode?  If so, there is
little hope.  If not, this might bear further investigation.

-Scott David Daniels
Scott.Daniels at Acm.Org