convert Unicode to lower/uppercase?

jallan jallan at smrtytrek.com
Mon Sep 22 11:24:50 EDT 2003


Peter Otten <__peter__ at web.de> wrote in message news:<bkm919$ast$01$1 at news.t-online.com>...
> "Martin v. Löwis" wrote:
> 
> > jallan wrote:
> > 
> >> But that really doesn't work properly. According to Unicode specs and
> >> German usage the uppercase of "ß" is actually "SS", that is the single
> >> character "ß" should uppercase to two characters.
> > 
> > Can you cite exact chapter and verse of the Unicode specs that say so?
> > According to the Unicode database,
> > 
> > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> > 
> > has neither an uppercase mapping, nor a lowercase mapping.
> 
> It seems like UnicodeData.txt does not give the full story. Quoting from
> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt:
> 
> [...]

> # (For compatibility, the UnicodeData.txt file only contains case mappings
> for
> # characters where they are 1-1, and does not have locale-specific
> mappings.)
> [...]
> # <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
> [...]
> # The German es-zed is special--the normal mapping is to SS.
> # Note: the titlecase should never occur in practice. It is equal to
> titlecase(uppercase(<es-zed>))
> 
> 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
> [...]
> 
> Thus, to comply with the standard, "ß".upper() --> "SS" is required.

Yes.

Also the Unicode main charts in the annotation for 00DF state:

    uppercase is "SS"

See http://www.unicode.org/charts/PDF/U0080.pdf

This note on the character first appeared in Unicode 1.0 (published in
1991) and has been in every revision.

Unicode 1.0, Volume One also lists this in the lower case to upper
case casing tables on page 453.

There is nothing new about this casing requirement.

A further mention occurs in the Unicode 4.0 specifications in Table
4-1 in section 4.2 Case--Normative.  See
http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf

This contains the warning:

<< Only legacy implementations that cannot handle case mappings that
increase sring lengths should use UnicodeData case mappings alone. The
single-character mappings are insufficient for languages such as
German. >>

So is Python just another shit legacy implementation?

Jim Allan




More information about the Python-list mailing list