Proposal: require 7-bit source str's

Sun Aug 22 18:17:32 EDT 2004

Martin v. Löwis wrote:
>Hallvard B Furuseth wrote:
>>>I agree with many things you said, but this example is bogus. If I
>>>(as a German) use ns_4551-1, sorting is simple - and incorrect, because,
>>>as you say, ö sorts with o in my language - yet the simple sorting of
>>>ns_4551-1 doesn't. So sorting is *not* simple with ns_4551-1.
> 
> Ah, I missed the point that there is no ö in ns_4551-1. If so, then the
> best way to represent the characters is to replace ö with "oe" and ä
> with "ae"; replacing them merely with "o" and "a" would be considered
> inadequat.

Duh.  Of course.  We usually did that too when we had to write Norwegian
in ASCII.  It bites sometimes, though - like when it hits the common '1
character = 1 byte' assumption which someone -- John Roth? mentioned.
Maybe that's why we are getting to ø->o in e-mail addresses and such
things nowadays, to keep things simple.

In a way, it is rather nice to notice that I'm forgetthing that stuff.
Maybe someday I won't even be able to read texts with {|} for æøå
without slowing down:-)

>> And if I want to get both right, I need a sort_name field which is
>> distinct from the display_name field.  There you would be lowis, while
>> the Swede Törnquist would be tørnquist.  Or maybe lowis\tlöwis or
>> something; a kind of private implementation of strxfrm().
> 
> But you can have a strxfrm for Unicode as well! There is nothing
> inherent in Unicode that prevents using the same approach.

Not after you have discarded the information which says whether to sort
ö as ø or o.

> Of course, the question always is what result you *want*: If you
> have text that contains simultaneously Latin and Greek characters,
> how would you like to collate it? Neither the German or Greek
> collation rules are likely to help, as they don't consider the issue
> of additional alphabets.

True enough.  But when you mix entirely different scripts, you have
worse problems anyway; you'll often need to transliterate your name to
the local script - or to something close to English, I guess.  A written
name in a script the locals can't read isn't particularly useful.

> If possible, you should assign a language tag to each entry, and then
> sort first by language, then according to the language's collation
> rules.

That sounds very wrong for lists that are sorted for humans to search,
unless I misunderstand you.  That would place all Swedes after all
Norwegians in the phone book, for example.  And if you aren't sure of
the nationality of someone, you'd have to look through all foreign
languages that are present.

-- 
Hallvard