[I18n-sig] possible bug in my UCA implementation

James Tauber jtauber at jtauber.com
Mon Jan 30 09:35:28 CET 2006


My Python Unicode Collation Algorithm implementation is giving  
unexpected results that could be because of:

1. a bug in my code
2. a bug in the DUCET
3. a difference of opinion between the way I think Ancient Greek  
should be collated and the way DUCET thinks so

I'd like to get the opinion of some of you who are more familiar with  
UCA (and perhaps can try my example out on ICU)

For the purposes of testing, say I'm trying to sort the three words:

(1)	ᾅδης
(2)	Ἄβελ
(3)	ἀββά

In my view they should be sorted in the reverse to what they are now,  
but my pyuca code sorts them in the order listed above.

pyuca assigns the words the following sort keys:

(1) ['0x124e', '0x0', '0x0', '0x0', '0x1252', '0x1257', '0x126a',  
'0x0', '0x20', '0x2a', '0x32', '0x97', '0x20', '0x20', '0x20', '0x0',  
'0x2', '0x2', '0x2', '0x2', '0x2', '0x2', '0x19', '0x0', '0x3b1',  
'0x314', '0x301', '0x345', '0x3b4', '0x3b7', '0x3c2']
(2) ['0x124e', '0x0', '0x0', '0x124f', '0x1253', '0x125c', '0x0',  
'0x20', '0x22', '0x32', '0x20', '0x20', '0x20', '0x0', '0x8', '0x2',  
'0x2', '0x2', '0x2', '0x2', '0x0', '0x391', '0x313', '0x301',  
'0x3b2', '0x3b5', '0x3bb']
(3) ['0x124e', '0x0', '0x124f', '0x124f', '0x124e', '0x0', '0x0',  
'0x20', '0x22', '0x20', '0x20', '0x20', '0x32', '0x0', '0x2', '0x2',  
'0x2', '0x2', '0x2', '0x2', '0x0', '0x3b1', '0x313', '0x3b2',  
'0x3b2', '0x3b1', '0x301']

The problem is that ᾅ (the first character of (1)) expands to 4  
collation elements, Ἄ (the first character of (2)) to 3 and ἀ (the  
first character of (3)) to 2 and as a result and, because all but the  
first element is zero, they are comparing less, just by virtue of  
having more collation elements.

I don't even understand why these letters are being treated as  
expansions rather than simply taking advantage of the secondary and  
tertiary levels, but sure enough that is how the DUCET describes them.

Am I missing something fundamental in the algorithm? Or is it  
possible the DUCET is wrong?

James


More information about the I18n-sig mailing list