convert Unicode to lower/uppercase?

jallan jallan at smrtytrek.com
Fri Sep 26 17:01:31 EDT 2003


"Neil Hodgson" <nhodgson at bigpond.net.au> wrote in message news:<wmJcb.122907$bo1.8337 at news-server.bigpond.net.au>...
> Me:
> 
> > for an illustrative but incorrect example
> > "ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}),
> 
>    For a real example from the Microsoft web site, uppercasing "indigo"
> (u'\u0069\u006e\u0064\u0069\u0067\u006f') gives "INDIGO"
> (u'\u0049\u004e\u0044\u0049\u0047\u004f') for English-US and similar but
> with dots above the 'I's for Turkish:
> (u'\u0130\u004e\u0044\u0130\u0047\u004f').
> 

The file http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
purportedly contains *all* casings for all scripts for all languages
where the casings are not one-to-one or are otherwise not
straightforward.

The *only* locale oddities there are for Lithuanian and the two
languages Turkish and Azeri and concern only dot/no-dot variants of
the letters _i_, _I_, _j_, _J_ and no others.

There are *no* other locale-based oddities. The mess is thankfully
*very* limited in scope.

In my opinion, if the full Unicode casing specification is to be
followed, the most useful solution would be a parameter allowing the
user to choose among (1) normal Latin casing, (2) Turkish/Azeri or (2)
Lithuanian as the casing model for treatment of these letters.

The default for the parameter would either be based on current locale
or be normal Latin casing. I think the latter far better as it is
dangerous to have functions in a language differ from machine to
machine according to the current locale.

Also, in case someone brings it up, it was formerly standard to
generally omit diacritics on capital letters in Portuguese and in
French (in France but not in Quebec!)

This is no longer the norm for either language. See
http://www.academie-francaise.fr/langue/questions.html#accentuation
and http://www.press.uchicago.edu/Misc/Chicago/cmosfaq/cmosfaq.SpecialCharacters.html.

I have seen academic style sheets with a silly rule that diacritics
should be placed on capital letters as on lowercase letters except for
the word "A". See http://www.alphaacademic.co.uk/fcs.htm and
http://www.sagepub.com/journalManuscript.aspx?pid=9669&sc=1:

<< We use accents on capital letters, but capital A does not take a
grave accent. >> 

It would not hurt to make a casing table customizable for such unusual
styles.  But that is beyond Unicode's specifications.

A programmer who wishes odd customization beyond the norms of a
language and Unicode specifications can do it through transformations
outside of normal casing.

Jim Allan




More information about the Python-list mailing list