Grapheme clusters, a.k.a.real characters

MRAB python at mrabarnett.plus.com
Sat Jul 15 21:54:06 EDT 2017


On 2017-07-16 02:20, Rick Johnson wrote:
> On Saturday, July 15, 2017 at 7:29:14 PM UTC-5, Chris Angelico wrote:
>> [...] Also, that doesn't deal with
>> U+200B or U+180E, which have well-defined widths *smaller* than
>> typical Latin letters. (200B is a zero-width space. Is it a
>> character?)
> 
> Of *COURSE* it's a character.
> 
> Would you also consider 0 not to be a number?
> 
> Sheesh!
> 
[snip]

You need to be careful about the terminology.

Is linefeed a character? You might call it a "control character", but 
it's not really a _character_, it's control/format _code_.

Is an acute accent a character? No, it's a diacritic mark that's added 
to a character.

When you're working with Unicode strings, you're not working with 
strings of characters as such, but with strings of 'codepoints', some of 
which are characters, others combining marks, yet others format codes, 
and so on.



More information about the Python-list mailing list