Could you verify this, Oh Great Unicode Experts of the Python-List?

Joshua Landau joshua at landau.ws
Sun Aug 11 05:54:43 EDT 2013


On 11 August 2013 07:24, Chris Angelico <rosuav at gmail.com> wrote:
> On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau <joshua at landau.ws> wrote:
>> Given tweet = b"caf\x65\xCC\x81".decode():
>>
>>     >>> tweet
>>     'café'
>>
>> But:
>>
>>     >>> len(tweet)
>>     5
>
> You're now looking at the difference between glyphs and combining
> characters. Twitter counts combining characters, so when you build one
> "thing" out of lots of separately-typed parts, it does count as more
> characters.

@https://dev.twitter.com/docs/counting-characters#Definition_of_a_Character
> The "café" issue mentioned above raises the question of how you count
> the characters in the Tweet string "café". To the human eye the length is
> clearly four characters. Depending on how the data is represented this
> could be either five or six UTF-8 bytes. Twitter does not want to penalize
> a user for the fact we use UTF-8 or for the fact that the API client in
> question used the longer representation. Therefore, Twitter does count
> "café" as four characters no matter which representation is sent.

Which would imply that twitter doesn't count combining characters,
even though the web interface seems to.

> Read this article for some arguments on the subject, including a
> number of references to Twitter itself:
>
> http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

I read that *last* time you pointed it out :P. It's a good link, though.

--
Anyhow, it's good to know I haven't been obviously stupid with my
understanding of Unicode. I learnt it all from this list anyway;
wouldn't want to disappoint!



More information about the Python-list mailing list