.title() - annoying mistake

Mon Mar 22 07:17:31 EDT 2021

On Mon, Mar 22, 2021 at 9:21 PM Robert Latest via Python-list
<python-list at python.org> wrote:
>
> Chris Angelico wrote:
> > If you still, after all these posts, have not yet understood that
> > title-casing *a single character* is a significant thing,
>
> I must admit I have no idea what title-casing even is, but I'm eager to learn.

Now that's an attitude I like to see :)

> The documentation only talks about "words" and "first characters" and
> "remaining characters." So a single character gets converted to uppercase,
> whatever that may mean in the context of .title(). The .upper() method is
> different in that it only applies to "cased" characters, so .title() may or may
> not work differently on single characters.
>

There are a small number of characters which, when case folded, become
more than one character. The sharp S from German behaves thusly:

>>> "ß".upper(), "ß".lower(), "ß".casefold(), "ß".title()
('SS', 'ß', 'ss', 'Ss')
>>> "ẞ".upper(), "ẞ".lower(), "ẞ".casefold(), "ẞ".title()
('ẞ', 'ß', 'ss', 'ẞ')

Serbian has another, although it can often be written with two
individual characters:

>>> "Ǆ".upper(), "Ǆ".lower(), "Ǆ".casefold(), "Ǆ".title()
('Ǆ', 'ǆ', 'ǆ', 'ǅ')
>>> ["U+%04X" % ord(x) for x in _]
['U+01C4', 'U+01C6', 'U+01C6', 'U+01C5']

Even in text that's in the Latin script (the one we use with English),
there are some ligatures that behave differently when titlecased:

>>> "ﬁ".upper(), "ﬁ".lower(), "ﬁ".casefold(), "ﬁ".title()
('FI', 'ﬁ', 'fi', 'Fi')
>>> [" ".join("U+%04X" % ord(c) for c in x) for x in _]
['U+0046 U+0049', 'U+FB01', 'U+0066 U+0069', 'U+0046 U+0069']

Each of these inputs is a single character; some of them have
single-character outputs (and in the case of U+01C5, that's a specific
character that is exclusively titlecased), others have multiple.

The neat thing about Unicode is that you don't have to worry about
exactly which characters behave in which ways. You get methods that do
precisely what you need, as long as you choose the right method. For
case insensitive comparisons, there's casefold(), which is most
commonly the same as lower(), but not always; to find out if
something's a digit, use isdigit(); to fracture something into lines,
use splitlines(). They're all aware of the entire Unicode range, and
they'll reliably work even if future versions of Unicode introduce
more characters (although you might have to wait for Python to be
updated).

The documentation sometimes shorthands things with terms like "upper
case" and "lower case", but that's partly because being pedantically
correct in a docstring doesn't actually help anything, and the code
itself IS correct.

ChrisA