Incorrect title case?

John Machin sjmachin at lexicon.net
Sat Jan 17 19:36:58 EST 2009


On Jan 18, 10:15 am, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > A value of zero for ctype->title should be interpreted simply as the
> > offset to add to the ordinal, as it is in the sibling _PyUnicode_To
> > (Upper|Lower)case functions.
>
> Interestingly enough, according to the spec of UnicodeData.txt,
> these should *not* be siblings. Refer to
>
> http://www.unicode.org/Public/UNIDATA/UCD.html
>
> For lower and upper case, it says
>
> Note: The simple uppercase is omitted in the data file if the uppercase
> is the same as the code point itself.
>
> whereas for titlecase, it says
>
> Note: The simple titlecase may be omitted in the data file if the
> titlecase is the same as the uppercase.

However: (1) there seem to be no examples in the current data file
where the titlecase is empty and the uppercase is not empty
(2) the titlecase is *NOT* empty for the four characters in question
-- they have [in effect] ch.title() -> ch as MRAB expected.

See my response in the bug tracker for further info/comment.

>
> So unicodectype is right to fall back to uppercase if no titlecase
> mapping is given.

Correct -- but this is currently hypothetical; moreover the "fallback"
is being done in the wrong place; it should be done in Tools/Unicode/
makeunicodedata.py when it reads the UnicodeData.txt file. The current
implementation codes the ch.title() -> ch mapping as delta = 0 which
is the same coding as used for "no titlecase specified in file"
leaving the runtime unicodetype with a dilemema which it resolves
wrongly -- it is *NOT* correct to pick uppercase when the titlecase is
actually specified in the UnicodeData.txt file.

Note that although it's not mentioned in the modification history for
UnicodeData.txt, the titlecase entry for the 4 characters changed from
"empty" to "self" in Unicode 4.0.0.

HTH,
John



More information about the Python-list mailing list