Instagram: 40% Py3 to 99% Py3 in 10 months (Posting On Python-List Prohibited)

Thu Jun 22 10:50:36 EDT 2017

On Thu, 22 Jun 2017 11:57 pm, Chris Angelico wrote:

> On Thu, Jun 22, 2017 at 11:33 PM, Steve D'Aprano
> <steve+python at pearwood.info> wrote:
>> and besides some Unicode code points are not
>> characters at all).
>>
>> http://www.unicode.org/faq/private_use.html#noncharacters
> 
> AIUI, "noncharacters" are like the IEEE floating point value
> "not-a-number". 

That's... kinda fair.

Although, the Unicode Consortium thinks of them as more like private use
characters, only even more private, and not characters :-)

(If you ask me, I think the noncharacters exist because "it seemed like a good
idea at the time" -- the use-case for them seems particularly ill-defined. I
suspect that if we were to redo Unicode from scratch, they wouldn't be
included.)

> So a character count should normally *include* any noncharacters in the
> string.

That depends on what you mean by *character*.

If you mean "code point", then I agree it should be counted.

If you mean "a letter, a digit, an ideograph, emoji, ... " then probably not.
(Depends what's in the ellipsis :-)

If you mean a grapheme, then certainly not, because the 66 Unicode noncharacters
don't belong to any human language.

If you mean "a grapheme cluster, or a code point for things which aren't
characters from human languages" then I guess they should be counted, as will
control characters, formatting marks, surrogate code points, and anything else
which doesn't represent a natural language character.

What is a natural language character? Is IJ one or two characters? Depends on
whether you're Dutch or not ;-)

This is why the Unicode standard tries not to talk in terms of "characters".
They're not well-defined.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.