convert Unicode to lower/uppercase?

jallan jallan at smrtytrek.com
Tue Sep 23 12:18:02 EDT 2003


martin at v.loewis.de (Martin v. Löwis) wrote in message news:<m3smmo5zx6.fsf at mira.informatik.hu-berlin.de>...
> Peter Otten <__peter__ at web.de> writes:
> 
> > # The German es-zed is special--the normal mapping is to SS.
> > # Note: the titlecase should never occur in practice. It is equal to
> > titlecase(uppercase(<es-zed>))
> > 
> > 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
> > [...]
> > 
> > Thus, to comply with the standard, "ß".upper() --> "SS" is required.
> 
> No. It would be required if .upper would claim to implement
> SpecialCasing - but it makes no such claim.

Of course not. From http://www.python.org/doc/current/lib/string-methods.html#l2h-203:

<<
*upper(  	)*
    Return a copy of the string converted to uppercase.
>>

This makes no claim about how the magic is done. But there is
certainly an implied claim that it is done correctly.

Unicode specifications are easily available at
http://www.unicode.org/versions/Unicode4.0.0/.

At 3.13 is indicated:

<< The full case mappings for Unicode characters are obtained by using
the mappings from SpecialCasing.txt _plus_ the mappings from
UnicodeData.txt, excluding any latter mappings that would conflict. >>

Case mappings for Unicode require use of SpecialCasing otherwise the
results are not in accord with the Unicode standard.

At 4.2 is found:

<< Only legacy implementations that cannot handle case mappings that
increase string lengths should use UnicodeData case mappings alone.
The single-character mappings are insufficient for languages such as
German >>

I don't see any particular reason why Python "cannot handle case
mappings that increase string lengths".

Unicode again warns that using UnicodeData.txt alone is not
sufficient.

The text continues on "SpecialCasting.txt":

<< Contains additional case mappings that map to more than one
character, such as "ß" to "SS". >>

Section 5.18 Case Mappings goes into further detail about casing
issues and  specifically mentions:

<< Case mappings may produce strings of different length than the
original. For example the German character U+00DF ß LATIN SMALL LETTER
SHAPR S expands when uppercase to the sequence of two characters "SS".
This also occurs where there is no prcomposed character corresponding
to a case mapping, such as with U+0149 'n LATIN SMALL LETTER N
PRECEDED BY APOSTROPHE. >>

See also http://www.unicode.org/faq/casemap_charprop-old.html for the
Unicode FAQ which contains:

<<
Q: Why is there no upper-case SHARP S (ß)?

A: There are 139 lower-case letters in Unicode 2.1 that have no direct
uppercase equivalent. Should there be introduced new bogus characters
for all of them, so that when you see an "fl" ligature you can
uppercase it to "FL" without expanding anything? Of course not.

Note that case conversion is inherently language-sensitive, notably in
the case of IPA, which needs to be left strictly alone even when
embedded in another language which is being case converted. The best
you can get is an approximate fit. [JC]

Q: Is all of the Unicode case mapping information in UnicodeData.txt?

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ß. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing.txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD]
>>

Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

The implied combined claim is that Python supports Unicode and
supports proper casing in Unicode.

This implied claim is false.

Truly accurate documentation for upper() should say that it uppercases
a string except for those characters where uppercasing would expand a
character to more than one character in which circumstance that
character is not uppercased or uppercased with loss of data.

Python specifications need not say how casing is done, whether by
using Unicode tables directly or by using its own methods that
accomplish the same results.

Users should not have to know such details. They may wish to know
where a particular function does not do what might be expected of it.

Jim Allan




More information about the Python-list mailing list