Multibyte Character Surport for Python
Lulu of the Lotus-Eaters
mertz at gnosis.cx
Fri May 10 14:21:31 EDT 2002
Skip Montanaro <skip at pobox.com> wrote previously:
|Yes, depending on what you pass to len(). If it's a plain string it
|definitely depends on the encoding:
| >>> u"a"
| u'a'
| >>> u"a".encode("utf-16")
| '\xff\xfea\x00'
| >>> u"a".encode("utf-8")
| 'a'
| >>> len(u"a".encode("utf-16"))
| 4
| >>> len(u"a".encode("utf-8"))
| 1
| >>> len(u"a")
| 1
Skip knows this, but novices might not. UTF-16 encoding is kinda a
funny case in term of length. Each UTF-16 string is prepended with a
two-byte "endian" header. So while Skip's example might suggest that
"a" takes 4 bytes to encode in UTF-16, it really only take 2 bytes, but
has a 2 byte "overhead." Compare:
>>> u"aa".encode("utf-16")
'\xff\xfea\x00a\x00'
>>> len(u"aa".encode("utf-16"))
6
>>> len(u"aaa".encode("utf-16"))
8
Yours, Lulu...
--
mertz@ _/_/_/_/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY:_/_/_/_/ v i
gnosis _/_/ Postmodern Enterprises _/_/ s r
.cx _/_/ MAKERS OF CHAOS.... _/_/ i u
_/_/_/_/_/ LOOK FOR IT IN A NEIGHBORHOOD NEAR YOU_/_/_/_/_/ g s
More information about the Python-list
mailing list