[I18n-sig] Unicode comparisons & normalization

Just van Rossum just@letterror.com
Wed, 3 May 2000 10:03:16 +0100


After quickly browsing through the unicode.org URLs I posted earlier, I
reach the following (possibly wrong) conclusions:

- there is a script and language independent canonical form (but automatic
normalization is indeed a bad idea)
- ideally, unicode comparisons should follow the rules from
http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
for 1.6, if at all...)
- this would indeed mean that it's possible for u == v even though type(u)
is type(v) and len(u) != len(v). However, I don't see how this would
collapse /F's world, as the two strings are at most semantically
equivalent. Their physical difference is real, and still follows the
a-string-is-a-sequence-of-characters rule (!).
- there may be additional customized language-specific sorting rules. I
currently don't see how to implement that without some global variable.
- the sorting rules are very complicated, and should be implemented by
calculating "sort keys". If I understood it correctly, these can take up to
4 bytes per character in its most compact form. Still, for it to be
somewhat speed-efficient, they need to be cached...
- u.find() may need an alternative API, which returns a (begin, end) tuple,
since the match may not have the same length as the search string... (This
is tricky, since you need the begin and end indices in the non-canonical
form...)

Just