Multibyte Character Surport for Python

Sat May 11 03:27:01 EDT 2002

huaiyu at gauss.almadan.ibm.com (Huaiyu Zhu) writes:

> >Instead, he proposed that non-ASCII identifiers are represented using
> >UTF-8 encoded byte strings (instead of being represented as Unicode
> >objects); in that case, and for those identifiers, len() would return
> >the number of UTF-8 bytes.
> 
> But would that be different from the number of characters?  

Yes. Watch this

>>> x=u"\N{EURO SIGN}"
>>> x
u'\u20ac'

This is a single character

>>> len(x.encode('utf-8'))
3

In UTF-8, this character has 3 bytes. Note that the number of bytes
in UTF-8 for a Unicode character varies between 1 and 4.

> My confusion comes from his assertion that Python itself does not need to
> care whether it's raw string or unicode.   Is there any need for the
> interpreter to split an identifier into sequence of characters?  If the
> answer is no, then I guess my question is moot.

The interpreter never does that, but still, a single identifier would
either be an ASCII byte string (the stress being on ASCII), or a
Unicode object:

>>> "x" == u"x"
1
>>> hash("x")
-1819822983
>>> hash(u"x")
-1819822983

Regards,
Martin