Ah Python, you have spoiled me for all other languages

Sun Jun 7 09:24:05 EDT 2015

On Sun, 7 Jun 2015 10:08 pm, Chris Angelico wrote:

> On Sun, Jun 7, 2015 at 9:42 PM, Steven D'Aprano <steve at pearwood.info>
> wrote:
>> My opinion is that a programming language like Python or ECMAScript
>> should operate on *code points*. If we want to call them "characters"
>> informally, that should be allowed, but whenever there is ambiguity we
>> should remember we're dealing with code points. The implementation
>> shouldn't matter: compliant Python interpreters might choose to use UTF-8
>> internally, or UTF-16, or UTF-32, or something else, and still agree on
>> how many characters a string contains. Normalisation is still an issue,
>> of course, but any decent Unicode implementation will include a way to
>> normalise or denormalise strings.
> 
> If by "normalise" you mean the NF[K]C/NF[K]D composition and
> decomposition, then yes, any decent Unicode library will provide that.

Dat's der bunny!

> I'm not sure it's critical to string handling itself, though; and
> Python defers the operation to the unicodedata module:
> 
>>>> s1 = "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}"
>>>> s2 = "\N{LATIN SMALL LETTER A WITH ACUTE}"
>>>> s1 == s2
> False
>>>> unicodedata.normalize("NFC", s1) == s2
> True
> 
> It's a useful operation to be able to do, but I would never expect
> that *string comparison* or other operations should automatically
> normalize.

I completely agree.

It might be convenient to have a string equality method that did
normalisation, but for most cases it would be unnecessary and slow. I think
that's the sort of thing which should be left to a subclass of str, and it
should normalise on construction.

> (Unless you want to say that all strings are guaranteed to 
> be NFC/NFD normalized, such that s1 and s2 would actually be
> identical, which I suppose is plausible. I'm not sure what the
> advantage would be, though. And certainly you wouldn't want to
> K-normalize strings automatically.)

I believe that filenames on Apple file systems (HFS+ if I remember
correctly) are guaranteed to be both normalised and correctly encoded as
UTF-8. If you could live in a purely Apple world, you'd have far fewer
filename hassles.

-- 
Steven