Could you verify this, Oh Great Unicode Experts of the Python-List?

Chris Angelico rosuav at gmail.com
Sun Aug 11 02:24:23 EDT 2013


On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau <joshua at landau.ws> wrote:
> Given tweet = b"caf\x65\xCC\x81".decode():
>
>     >>> tweet
>     'café'
>
> But:
>
>     >>> len(tweet)
>     5

You're now looking at the difference between glyphs and combining
characters. Twitter counts combining characters, so when you build one
"thing" out of lots of separately-typed parts, it does count as more
characters.

Read this article for some arguments on the subject, including a
number of references to Twitter itself:

http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

ChrisA



More information about the Python-list mailing list