[Python-Dev] Hash collision security issue (now public)

Sun Jan 8 23:33:32 CET 2012

In http://mail.python.org/pipermail/python-dev/2012-January/115368.html
Stefan Behnel wrote:

> Admittedly, this may require some adaptation for the PEP393 unicode memory
> layout in order to produce identical hashes for all three representations
> if they represent the same content.

They SHOULD NOT represent the same content; comparing two strings
currently requires converting them to canonical form, which means the
smallest format (of those three) that works.

If it can be represented in PyUnicode_1BYTE_KIND, then representations
using PyUnicode_2BYTE_KIND or PyUnicode_4BYTE_KIND don't count as
canonical, won't be created by Python itself, and already compare
unequal according to both PyUnicode_RichCompare and stringlib/eq.h (a
shortcut used by dicts).

That said, I don't think smallest-format is actually enforced with
anything stronger than comments (such as in unicodeobject.h struct
PyASCIIObject) and asserts (mostly calling
_PyUnicode_CheckConsistency).  I don't have any insight on how
prevalent non-conforming strings will be in practice, or whether
supporting their equality will be required as a bugfix.

-jJ