Could you verify this, Oh Great Unicode Experts of the Python-List?

MRAB python at mrabarnett.plus.com
Sun Aug 11 11:34:45 EDT 2013


On 11/08/2013 10:54, Joshua Landau wrote:
> On 11 August 2013 07:24, Chris Angelico <rosuav at gmail.com> wrote:
>> On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau <joshua at landau.ws> wrote:
>>> Given tweet = b"caf\x65\xCC\x81".decode():
>>>
>>>     >>> tweet
>>>     'café'
>>>
>>> But:
>>>
>>>     >>> len(tweet)
>>>     5
>>
>> You're now looking at the difference between glyphs and combining
>> characters. Twitter counts combining characters, so when you build one
>> "thing" out of lots of separately-typed parts, it does count as more
>> characters.
>
> @https://dev.twitter.com/docs/counting-characters#Definition_of_a_Character
>> The "café" issue mentioned above raises the question of how you count
>> the characters in the Tweet string "café". To the human eye the length is
>> clearly four characters. Depending on how the data is represented this
>> could be either five or six UTF-8 bytes. Twitter does not want to penalize
>> a user for the fact we use UTF-8 or for the fact that the API client in
>> question used the longer representation. Therefore, Twitter does count
>> "café" as four characters no matter which representation is sent.
>
> Which would imply that twitter doesn't count combining characters,
> even though the web interface seems to.
>
>> Read this article for some arguments on the subject, including a
>> number of references to Twitter itself:
>>
>> http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
>
> I read that *last* time you pointed it out :P. It's a good link, though.
>
> --
> Anyhow, it's good to know I haven't been obviously stupid with my
> understanding of Unicode. I learnt it all from this list anyway;
> wouldn't want to disappoint!
>
If twitter counts characters, not codepoints, you could then ask
whether it passes the codepoints through as given. If it does, then you
experiment to see how much data you could send encoded as a sequence of
combining codepoints. (You might want to check the Term of Use first,
though! :-))



More information about the Python-list mailing list