[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

Ezio Melotti report at bugs.python.org
Mon Aug 15 12:20:26 CEST 2011


Ezio Melotti <ezio.melotti at gmail.com> added the comment:

So the issue here is that while using combing chars, str.title() fails to titlecase the string properly.

The algorithm implemented by str.title() [0] is quite simple: it loops through the code units, and uppercases all the chars that follow a char that is not lower/upper/titlecased.
This means that if Déme doesn't use combining accents, the char before the 'm' is 'é', 'é' is a lowercase char, so 'm' is not capitalized.
If the 'é' is represented as 'e' + '´', the char before the 'm' is '´', '´' is not a lower/upper/titlecase char, so the 'm' is capitalized.

I guess we could normalize the string before doing the title casing, and then normalize it back.
Also the str methods don't claim to follow Unicode afaik, so unless we decide that they should, we could implement whatever algorithm we want.

[0]: Objects/unicodeobject.c:6752

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12737>
_______________________________________


More information about the Python-bugs-list mailing list