[Python-Dev] UTF-16 code point comparison

Finn Bock bckfnn@worldonline.dk
Fri, 28 Jul 2000 05:15:17 GMT


[M.-A. Lemburg]

>Finn Bock wrote:
>> 
>> [M.-A. Lemburg]
>> 
>> >BTW, does Java support UCS-4 ? If not, then Java is wrong
>> >here ;-)
>> 
>> Java claims to use unicode 2.1 [*]. I couldn't locate anything describing if
>> this is UCS-2 or UTF-16. I think unicode 2.1 includes UCS-4. The actual
>> level of support for UCS-4 is properly debatable.
>> 
>> - The builtin char is 16bit wide and can obviously not support UCS-4.
>> - The Character class can report if a character is a surrogate:
>>     >>> from java.lang import Character
>>     >>> Character.getType("\ud800") == Character.SURROGATE
>>     1
>
>>>> unicodedata.category(u'\ud800')
>'Cs'
>
>... which means the same thing only in Unicode3 standards
>notation.
>
>Make me think: perhaps we should add the Java constants to
>unicodedata base. Is there a list of those available
>somewhere ?

UNASSIGNED = 0
UPPERCASE_LETTER
LOWERCASE_LETTER
TITLECASE_LETTER
MODIFIER_LETTER
OTHER_LETTER
NON_SPACING_MARK
ENCLOSING_MARK
COMBINING_SPACING_MARK
DECIMAL_DIGIT_NUMBER
LETTER_NUMBER
OTHER_NUMBER
SPACE_SEPARATOR
LINE_SEPARATOR
PARAGRAPH_SEPARATOR
CONTROL
FORMAT
PRIVATE_USE
SURROGATE 
DASH_PUNCTUATION
START_PUNCTUATION
END_PUNCTUATION 
CONNECTOR_PUNCTUATION 
OTHER_PUNCTUATION 
MATH_SYMBOL 
CURRENCY_SYMBOL 
MODIFIER_SYMBOL 
OTHER_SYMBOL 


>> - As reported, direct string comparison ignore surrogates.
>
>I would guess that this'll have to change as soon as JavaSoft
>folks realize that they need to handle UTF-16 and not only
>UCS-2.

Predicting the future can be difficult, but here is my take:
javasoft will never change the way String.compareTo works.  
String.compareTo is documented as:
"""
  Compares two strings lexicographically. The comparison is based on 
  the Unicode value of each character in the strings. ...
"""

Instead they will mark it as a very naive string comparison and suggest
users to use the Collator classes for anything but the simplest cases.


>> - The BreakIterator does not handle surrogates. It does handle
>>   combining characters and it seems a natural place to put support
>>   for surrogates.
>
>What is a BreakIterator ? An iterator to scan line/word breaks ?

Yes, as well as character breaks. It already contains the framework for
marking two chars next to each other as one.

>> - The Collator class offers different levels of normalization before
>>   comparing string but does not seem to support surrogates. This class
>>   seems a natural place for javasoft to put support for surrogates
>>   during string comparison.
>
>We'll need something like this for 2.1 too: first some
>standard APIs for normalization and then a few unicmp()
>APIs to use for sorting.
>
>We might even have to add collation sequences somewhere because
>this is a locale issue as well... sometimes it's even worse
>with different strategies used for different tasks within one
>locale, e.g. in Germany we sometimes sort the Umlaut ä as "ae"
>and at other times as extra character.

Info: The java Collator class is configured with 
- a locale and 
- a strengh parameter
   IDENTICAL; all difference are significant.
   PRIMARY (a vs b)
   SECONDARY (a vs ä)
   TERTIARY (a vs A)
- a decomposition (http://www.unicode.org/unicode/reports/tr15/)
   NO_DECOMPOSITION
   CANONICAL_DECOMPOSITION
   FULL_DECOMPOSITION

regards,
finn