[I18n-sig] possible bug in my UCA implementation

James Tauber jtauber at jtauber.com
Mon Feb 13 05:02:50 CET 2006


I discovered I wasn't converting to NFD first but this doesn't solve  
the problem, it just explains why expansions were used.

Even using NFD, I get the same result.

James

On 30/01/2006, at 4:35 PM, James Tauber wrote:

>
> My Python Unicode Collation Algorithm implementation is giving  
> unexpected results that could be because of:
>
> 1. a bug in my code
> 2. a bug in the DUCET
> 3. a difference of opinion between the way I think Ancient Greek  
> should be collated and the way DUCET thinks so
>
> I'd like to get the opinion of some of you who are more familiar  
> with UCA (and perhaps can try my example out on ICU)
>
> For the purposes of testing, say I'm trying to sort the three words:
>
> (1)	ᾅδης
> (2)	Ἄβελ
> (3)	ἀββά
>
> In my view they should be sorted in the reverse to what they are  
> now, but my pyuca code sorts them in the order listed above.
>
> pyuca assigns the words the following sort keys:
>
> (1) ['0x124e', '0x0', '0x0', '0x0', '0x1252', '0x1257', '0x126a',  
> '0x0', '0x20', '0x2a', '0x32', '0x97', '0x20', '0x20', '0x20',  
> '0x0', '0x2', '0x2', '0x2', '0x2', '0x2', '0x2', '0x19', '0x0',  
> '0x3b1', '0x314', '0x301', '0x345', '0x3b4', '0x3b7', '0x3c2']
> (2) ['0x124e', '0x0', '0x0', '0x124f', '0x1253', '0x125c', '0x0',  
> '0x20', '0x22', '0x32', '0x20', '0x20', '0x20', '0x0', '0x8',  
> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x391', '0x313',  
> '0x301', '0x3b2', '0x3b5', '0x3bb']
> (3) ['0x124e', '0x0', '0x124f', '0x124f', '0x124e', '0x0', '0x0',  
> '0x20', '0x22', '0x20', '0x20', '0x20', '0x32', '0x0', '0x2',  
> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x3b1', '0x313',  
> '0x3b2', '0x3b2', '0x3b1', '0x301']
>
> The problem is that ᾅ (the first character of (1)) expands to 4  
> collation elements, Ἄ (the first character of (2)) to 3 and ἀ  
> (the first character of (3)) to 2 and as a result and, because all  
> but the first element is zero, they are comparing less, just by  
> virtue of having more collation elements.
>
> I don't even understand why these letters are being treated as  
> expansions rather than simply taking advantage of the secondary and  
> tertiary levels, but sure enough that is how the DUCET describes them.
>
> Am I missing something fundamental in the algorithm? Or is it  
> possible the DUCET is wrong?
>
> James



More information about the I18n-sig mailing list