Proposal: require 7-bit source str's

"Martin v. Löwis" martin at v.loewis.de
Sun Aug 22 17:40:24 EDT 2004


Hallvard B Furuseth wrote:
>>I agree with many things you said, but this example is bogus. If I
>>(as a German) use ns_4551-1, sorting is simple - and incorrect, because,
>>as you say, ö sorts with o in my language - yet the simple sorting of
>>ns_4551-1 doesn't. So sorting is *not* simple with ns_4551-1.

Ah, I missed the point that there is no ö in ns_4551-1. If so, then the
best way to represent the characters is to replace ö with "oe" and ä
with "ae"; replacing them merely with "o" and "a" would be considered
inadequat.

> And if I want to get both right, I need a sort_name field which is
> distinct from the display_name field.  There you would be lowis, while
> the Swede Törnquist would be tørnquist.  Or maybe lowis\tlöwis or
> something; a kind of private implementation of strxfrm().

But you can have a strxfrm for Unicode as well! There is nothing
inherent in Unicode that prevents using the same approach.

Of course, the question always is what result you *want*: If you
have text that contains simultaneously Latin and Greek characters,
how would you like to collate it? Neither the German or Greek
collation rules are likely to help, as they don't consider the issue
of additional alphabets. If possible, you should assign a language
tag to each entry, and then sort first by language, then according
to the language's collation rules.

Regards,
Martin



More information about the Python-list mailing list