Ah Python, you have spoiled me for all other languages

Sun Jun 7 08:08:06 EDT 2015

On Sun, Jun 7, 2015 at 9:42 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> My opinion is that a programming language like Python or ECMAScript should
> operate on *code points*. If we want to call them "characters" informally,
> that should be allowed, but whenever there is ambiguity we should remember
> we're dealing with code points. The implementation shouldn't matter:
> compliant Python interpreters might choose to use UTF-8 internally, or
> UTF-16, or UTF-32, or something else, and still agree on how many
> characters a string contains. Normalisation is still an issue, of course,
> but any decent Unicode implementation will include a way to normalise or
> denormalise strings.

If by "normalise" you mean the NF[K]C/NF[K]D composition and
decomposition, then yes, any decent Unicode library will provide that.
I'm not sure it's critical to string handling itself, though; and
Python defers the operation to the unicodedata module:

>>> s1 = "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}"
>>> s2 = "\N{LATIN SMALL LETTER A WITH ACUTE}"
>>> s1 == s2
False
>>> unicodedata.normalize("NFC", s1) == s2
True

It's a useful operation to be able to do, but I would never expect
that *string comparison* or other operations should automatically
normalize. (Unless you want to say that all strings are guaranteed to
be NFC/NFD normalized, such that s1 and s2 would actually be
identical, which I suppose is plausible. I'm not sure what the
advantage would be, though. And certainly you wouldn't want to
K-normalize strings automatically.)

> The question of graphemes (what "ordinary people" consider letters and
> characters, e.g. "ch" is two letters to an English speaker but one letter
> to a Czech speaker) should be left to libraries. It's a much harder problem
> to solve in the full general case, requires localisation, and is overkill
> for many string-processing tasks.

Yeah. The basic challenge to a beginning programmer, "reverse this
string", becomes rather tricky in the presence of natural language.

>>> s1 += "e"
>>> s1
'áe'
>>> s1[::-1]
'éa'

Oops.

But hey. It's easier to understand what went wrong here than, say, if
you reverse the bytes in a UTF-8 stream. Or the code units in a UTF-16
stream. If you're lucky, those would give you instant errors... if
you're not, well, who knows.

ChrisA