Could you verify this, Oh Great Unicode Experts of the Python-List?

Sun Aug 11 05:44:40 EDT 2013

On 11 August 2013 10:09, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> The reason some accented letters have single code point forms is to
> support legacy charsets; the reason some only exist as combining
> characters is due to the combinational explosion. Some languages allow
> you to add up to five or six different accent on any of dozens of
> different letters. If each combination needed its own unique code point,
> there wouldn't be enough code points. For bonus points, if there are five
> accents that can be placed in any combination of zero or more on any of
> four characters, how many code points would be needed?

52?

> Note that the form you used, b"caf\x65\xCC\x81", is the same as the first
> except that you have shown "e" in hex for some reason:
>
> py> b'\x65' == b'e'
> True

Yeah.. I did that because the linked post did it. I'm not sure why either ;).

> On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote:
>>
>> So the solution is:
>>
>>     >>> import unicodedata
>>     >>> len(unicodedata.normalize("NFC", tweet))
>>     4
>
> In this particular case, this will reduce the tweet to the normalised
> form that Twitter uses.
>
> [...]
>> After further testing (I don't actually use Twitter) it seems the whole
>> thing was just smoke and mirrors. The linked article is a lie, at least
>> on the user's end.
>
> Which linked article? The one on dev.twitter.com seems to be okay to me.

That's the one.

> Of course, they might be lying when they say "Twitter counts the length
> of a Tweet using the Normalization Form C (NFC) version of the text", I
> have no idea. But the seem to have a good grasp of the issues involved,
> and assuming they do what they say, at least Western European users
> should be happy.

They *don't* seem to be doing what they say.

>> On Linux you can prove this by running:
>>
>>     >>> p = subprocess.Popen(['xsel', '-bi'], stdin=subprocess.PIPE)
>>     >>> p.communicate(input=b"caf\x65\xCC\x81")
>>     (None, None)
>>
>> "café" will be in your Copy-Paste buffer, and you can paste it in to
>> the tweet-box. It takes 5 characters. So much for testing ;).
>
> How do you know that it takes 5 characters? Is that some Javascript
> widget? I'd blame buggy Javascript before Twitter.

I go to twitter.com, log in and press that odd blue compose button in
the top-right. After pasting at says I have 135 (down from 140)
characters left.

My only question here is, since you can't post after 140
non-normalised characters, who cares if the server counts it as less?

> If this shows up in your application as café rather than café, it is a
> bug in the text rendering engine. Some applications do not deal with
> combining characters correctly.

Why the rendering engine?

> (It's a hard problem to solve, and really needs support from the font. In
> some languages, the same accent will appear in different places depending
> on the character they are attached to, or the other accents there as
> well. Or so I've been lead to believe.)
>
>
>> ¹ https://dev.twitter.com/docs/counting-
>> characters#Definition_of_a_Character
>
> Looks reasonable to me. No obvious errors to my eyes.

*Not sure whether talking about the link or my post*