Multibyte Character Surport for Python

Fri May 10 14:21:31 EDT 2002

Skip Montanaro <skip at pobox.com> wrote previously:
|Yes, depending on what you pass to len().  If it's a plain string it
|definitely depends on the encoding:
|    >>> u"a"
|    u'a'
|    >>> u"a".encode("utf-16")
|    '\xff\xfea\x00'
|    >>> u"a".encode("utf-8")
|    'a'
|    >>> len(u"a".encode("utf-16"))
|    4
|    >>> len(u"a".encode("utf-8"))
|    1
|    >>> len(u"a")
|    1

Skip knows this, but novices might not.  UTF-16 encoding is kinda a
funny case in term of length.  Each UTF-16 string is prepended with a
two-byte "endian" header.  So while Skip's example might suggest that
"a" takes 4 bytes to encode in UTF-16, it really only take 2 bytes, but
has a 2 byte "overhead."  Compare:

    >>> u"aa".encode("utf-16")
    '\xff\xfea\x00a\x00'
    >>> len(u"aa".encode("utf-16"))
    6
    >>> len(u"aaa".encode("utf-16"))
    8

Yours, Lulu...

--
 mertz@   _/_/_/_/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY:_/_/_/_/ v i
gnosis  _/_/                    Postmodern Enterprises         _/_/  s r
.cx    _/_/  MAKERS OF CHAOS....                              _/_/   i u
      _/_/_/_/_/ LOOK FOR IT IN A NEIGHBORHOOD NEAR YOU_/_/_/_/_/    g s