Could you verify this, Oh Great Unicode Experts of the Python-List?

Sun Aug 11 05:09:44 EDT 2013

On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote:

> Basically, I think Twitter's broken.

Oh, in about a million ways, but apparently people like it :-(

> For my full discusion on the matter, see:
> http://www.reddit.com/r/learnpython/comments/1k2yrn/
help_with_len_and_input_function_33/cbku5e8
> 
> Here's the first post of mine, ineffectually edited for this list:
> 
> """
> <strikethrough>The obvious solution [to getting the length of a tweet]
> is wrong. Like, slightly wrong¹.</strikethrough>
> 
> Given tweet = b"caf\x65\xCC\x81".decode():

I assume you're using Python 3, where UTF-8 is the default encoding.

>     >>> tweet
>     'café'
> 
> But:
> 
>     >>> len(tweet)
>     5

Yes, that's correct. Unicode doesn't promise to have a single unique 
representation for all human-readable strings. In this case, the string 
"cafe" with an accent on the "e" can be generated by two sequences of 
code points:

LATIN SMALL LETTER C
LATIN SMALL LETTER A
LATIN SMALL LETTER F
LATIN SMALL LETTER E
COMBINING ACUTE ACCENT

or 

LATIN SMALL LETTER C
LATIN SMALL LETTER A
LATIN SMALL LETTER F
LATIN SMALL LETTER E WITH ACUTE

The reason some accented letters have single code point forms is to 
support legacy charsets; the reason some only exist as combining 
characters is due to the combinational explosion. Some languages allow 
you to add up to five or six different accent on any of dozens of 
different letters. If each combination needed its own unique code point, 
there wouldn't be enough code points. For bonus points, if there are five 
accents that can be placed in any combination of zero or more on any of 
four characters, how many code points would be needed?

Neither form is "right" or "wrong", they are both equally valid. They 
encode differently, of course, since UTF-8 does guarantee that every 
sequence of code points has a unique byte representation:

py> tweet.encode('utf-8')
'cafe\xcc\x81'
py> u'café'.encode('utf-8')
'caf\xc3\xa9'

Note that the form you used, b"caf\x65\xCC\x81", is the same as the first 
except that you have shown "e" in hex for some reason:

py> b'\x65' == b'e'
True

> So the solution is:
> 
>     >>> import unicodedata
>     >>> len(unicodedata.normalize("NFC", tweet))
>     4

In this particular case, this will reduce the tweet to the normalised 
form that Twitter uses.

[...]
> After further testing (I don't actually use Twitter) it seems the whole
> thing was just smoke and mirrors. The linked article is a lie, at least
> on the user's end.

Which linked article? The one on dev.twitter.com seems to be okay to me. 
Of course, they might be lying when they say "Twitter counts the length 
of a Tweet using the Normalization Form C (NFC) version of the text", I 
have no idea. But the seem to have a good grasp of the issues involved, 
and assuming they do what they say, at least Western European users 
should be happy.

> On Linux you can prove this by running:
> 
>     >>> p = subprocess.Popen(['xsel', '-bi'], stdin=subprocess.PIPE)
>     >>> p.communicate(input=b"caf\x65\xCC\x81")
>     (None, None)
> 
> "café" will be in your Copy-Paste buffer, and you can paste it in to
> the tweet-box. It takes 5 characters. So much for testing ;).

How do you know that it takes 5 characters? Is that some Javascript 
widget? I'd blame buggy Javascript before Twitter.

If this shows up in your application as café rather than café, it is a 
bug in the text rendering engine. Some applications do not deal with 
combining characters correctly.

(It's a hard problem to solve, and really needs support from the font. In 
some languages, the same accent will appear in different places depending 
on the character they are attached to, or the other accents there as 
well. Or so I've been lead to believe.)

> ¹ https://dev.twitter.com/docs/counting-
> characters#Definition_of_a_Character

Looks reasonable to me. No obvious errors to my eyes.

-- 
Steven