Could you verify this, Oh Great Unicode Experts of the Python-List?
Chris Angelico
rosuav at gmail.com
Sun Aug 11 02:24:23 EDT 2013
On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau <joshua at landau.ws> wrote:
> Given tweet = b"caf\x65\xCC\x81".decode():
>
> >>> tweet
> 'café'
>
> But:
>
> >>> len(tweet)
> 5
You're now looking at the difference between glyphs and combining
characters. Twitter counts combining characters, so when you build one
"thing" out of lots of separately-typed parts, it does count as more
characters.
Read this article for some arguments on the subject, including a
number of references to Twitter itself:
http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
ChrisA
More information about the Python-list
mailing list