Multibyte Character Surport for Python
Martin v. Löwis
loewis at informatik.hu-berlin.de
Sat May 11 03:27:01 EDT 2002
huaiyu at gauss.almadan.ibm.com (Huaiyu Zhu) writes:
> >Instead, he proposed that non-ASCII identifiers are represented using
> >UTF-8 encoded byte strings (instead of being represented as Unicode
> >objects); in that case, and for those identifiers, len() would return
> >the number of UTF-8 bytes.
>
> But would that be different from the number of characters?
Yes. Watch this
>>> x=u"\N{EURO SIGN}"
>>> x
u'\u20ac'
This is a single character
>>> len(x.encode('utf-8'))
3
In UTF-8, this character has 3 bytes. Note that the number of bytes
in UTF-8 for a Unicode character varies between 1 and 4.
> My confusion comes from his assertion that Python itself does not need to
> care whether it's raw string or unicode. Is there any need for the
> interpreter to split an identifier into sequence of characters? If the
> answer is no, then I guess my question is moot.
The interpreter never does that, but still, a single identifier would
either be an ASCII byte string (the stress being on ASCII), or a
Unicode object:
>>> "x" == u"x"
1
>>> hash("x")
-1819822983
>>> hash(u"x")
-1819822983
Regards,
Martin
More information about the Python-list
mailing list