Could you verify this, Oh Great Unicode Experts of the Python-List?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Aug 11 07:14:52 EDT 2013


On Sun, 11 Aug 2013 10:44:40 +0100, Joshua Landau wrote:

> On 11 August 2013 10:09, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> The reason some accented letters have single code point forms is to
>> support legacy charsets; the reason some only exist as combining
>> characters is due to the combinational explosion. Some languages allow
>> you to add up to five or six different accent on any of dozens of
>> different letters. If each combination needed its own unique code
>> point, there wouldn't be enough code points. For bonus points, if there
>> are five accents that can be placed in any combination of zero or more
>> on any of four characters, how many code points would be needed?
> 
> 52?

More than double that.

Consider a single character. It can have 0 to 5 accents, in any 
combination. Order doesn't matter, and there are no duplicates, so there 
are:

0 accent: take 0 from 5 = 1 combination;
1 accent: take 1 from 5 = 5 combinations;
2 accents: take 2 from 5 = 5!/(2!*3!) = 10 combinations;
3 accents: take 3 from 5 = 5!/(3!*2!) = 10 combinations;
4 accents: take 4 from 5 = 5 combinations;
5 accents: take 5 from 5 = 1 combination

giving a total of 32 combinations for a single character. Since there are 
four characters in this hypothetical language that take accents, that 
gives a total of 4*32 = 128 distinct code points needed.

In reality, Unicode has currently code points U+0300 to U+036F (112 code 
points) to combining characters. It's not really meaningful to combine 
all 112 of them, or even most of 112 of them, but let's assume that we 
can legitimately combine up to three of them on average (some languages 
will allow more, some less) on just six different letters. That gives us:

0 accent: 1 combination
1 accent: 112 combinations
2 accents: 112!/(2!*110!) = 6216 combinations
3 accents: 112!/(3!*109!) = 227920 combinations

giving 234249 combinations, by six base characters, = 1405494 code 
points. Which is comfortably more than the 1114112 code points Unicode 
has in total :-)

This calculation is horribly inaccurate, since you can't arbitrarily 
combine (say) accents from Greek with accents from IPA, but I reckon that 
the combinational explosion of accented letters is still real.


[...]
>> Of course, they might be lying when they say "Twitter counts the length
>> of a Tweet using the Normalization Form C (NFC) version of the text", I
>> have no idea. But the seem to have a good grasp of the issues involved,
>> and assuming they do what they say, at least Western European users
>> should be happy.
> 
> They *don't* seem to be doing what they say.
[...]
>>> "café" will be in your Copy-Paste buffer, and you can paste it in to
>>> the tweet-box. It takes 5 characters. So much for testing ;).
>>
>> How do you know that it takes 5 characters? Is that some Javascript
>> widget? I'd blame buggy Javascript before Twitter.
> 
> I go to twitter.com, log in and press that odd blue compose button in
> the top-right. After pasting at says I have 135 (down from 140)
> characters left.

I'm pretty sure that will be a piece of Javascript running in your 
browser that reports the number of characters in the text box. So, I 
would expect that either:

- Javascript doesn't provide a way to normalize text;

- Twitter's Javascript developer(s) don't know how to normalize text, or 
can't be bothered to follow company policy (shame on them);

- the Javascript just asks the browser, and the browser doesn't know how 
to count characters the Twitter way;

etc. But of course posting to Twitter via your browser isn't the only way 
to post. Twitter provide an API to twit, and *that* is the ultimate test 
of whether Twitter's dev guide is lying or not.


> My only question here is, since you can't post after 140 non-normalised
> characters, who cares if the server counts it as less?

People who bypass the browser and write their own Twitter client.


>> If this shows up in your application as café rather than café, it is a
>> bug in the text rendering engine. Some applications do not deal with
>> combining characters correctly.
> 
> Why the rendering engine?

If the text renderer assumes it can draw once code point at a time, it 
will draw the "e", then reach the combining accent. It could, in 
principle, backspace and draw it over the "e", but more likely it will 
just draw it next to it.

What the renderer should do is walk the string, collecting characters 
until it reaches one which is not a combining character, then draw them 
all at once one on top of each other. A good font may have special 
glyphs, or at least hints, for combining accents. For instance, if you 
have a dot accent and a comma accent drawn one on top of the other, it 
looks like a comma; what you are supposed to do is move them side by 
side, so you have separate dot and comma glyphs.


>> (It's a hard problem to solve, and really needs support from the font.
>> In some languages, the same accent will appear in different places
>> depending on the character they are attached to, or the other accents
>> there as well. Or so I've been lead to believe.)
>>
>>
>>> ¹ https://dev.twitter.com/docs/counting-
>>> characters#Definition_of_a_Character
>>
>> Looks reasonable to me. No obvious errors to my eyes.
> 
> *Not sure whether talking about the link or my post*

The dev.twitter.com post.



-- 
Steven



More information about the Python-list mailing list