[Python-Dev] UTF-16 code point comparison
Finn Bock
bckfnn@worldonline.dk
Fri, 28 Jul 2000 05:15:17 GMT
[M.-A. Lemburg]
>Finn Bock wrote:
>>
>> [M.-A. Lemburg]
>>
>> >BTW, does Java support UCS-4 ? If not, then Java is wrong
>> >here ;-)
>>
>> Java claims to use unicode 2.1 [*]. I couldn't locate anything describing if
>> this is UCS-2 or UTF-16. I think unicode 2.1 includes UCS-4. The actual
>> level of support for UCS-4 is properly debatable.
>>
>> - The builtin char is 16bit wide and can obviously not support UCS-4.
>> - The Character class can report if a character is a surrogate:
>> >>> from java.lang import Character
>> >>> Character.getType("\ud800") == Character.SURROGATE
>> 1
>
>>>> unicodedata.category(u'\ud800')
>'Cs'
>
>... which means the same thing only in Unicode3 standards
>notation.
>
>Make me think: perhaps we should add the Java constants to
>unicodedata base. Is there a list of those available
>somewhere ?
UNASSIGNED = 0
UPPERCASE_LETTER
LOWERCASE_LETTER
TITLECASE_LETTER
MODIFIER_LETTER
OTHER_LETTER
NON_SPACING_MARK
ENCLOSING_MARK
COMBINING_SPACING_MARK
DECIMAL_DIGIT_NUMBER
LETTER_NUMBER
OTHER_NUMBER
SPACE_SEPARATOR
LINE_SEPARATOR
PARAGRAPH_SEPARATOR
CONTROL
FORMAT
PRIVATE_USE
SURROGATE
DASH_PUNCTUATION
START_PUNCTUATION
END_PUNCTUATION
CONNECTOR_PUNCTUATION
OTHER_PUNCTUATION
MATH_SYMBOL
CURRENCY_SYMBOL
MODIFIER_SYMBOL
OTHER_SYMBOL
>> - As reported, direct string comparison ignore surrogates.
>
>I would guess that this'll have to change as soon as JavaSoft
>folks realize that they need to handle UTF-16 and not only
>UCS-2.
Predicting the future can be difficult, but here is my take:
javasoft will never change the way String.compareTo works.
String.compareTo is documented as:
"""
Compares two strings lexicographically. The comparison is based on
the Unicode value of each character in the strings. ...
"""
Instead they will mark it as a very naive string comparison and suggest
users to use the Collator classes for anything but the simplest cases.
>> - The BreakIterator does not handle surrogates. It does handle
>> combining characters and it seems a natural place to put support
>> for surrogates.
>
>What is a BreakIterator ? An iterator to scan line/word breaks ?
Yes, as well as character breaks. It already contains the framework for
marking two chars next to each other as one.
>> - The Collator class offers different levels of normalization before
>> comparing string but does not seem to support surrogates. This class
>> seems a natural place for javasoft to put support for surrogates
>> during string comparison.
>
>We'll need something like this for 2.1 too: first some
>standard APIs for normalization and then a few unicmp()
>APIs to use for sorting.
>
>We might even have to add collation sequences somewhere because
>this is a locale issue as well... sometimes it's even worse
>with different strategies used for different tasks within one
>locale, e.g. in Germany we sometimes sort the Umlaut ä as "ae"
>and at other times as extra character.
Info: The java Collator class is configured with
- a locale and
- a strengh parameter
IDENTICAL; all difference are significant.
PRIMARY (a vs b)
SECONDARY (a vs ä)
TERTIARY (a vs A)
- a decomposition (http://www.unicode.org/unicode/reports/tr15/)
NO_DECOMPOSITION
CANONICAL_DECOMPOSITION
FULL_DECOMPOSITION
regards,
finn