Multibyte Character Surport for Python

Martin v. Loewis martin at v.loewis.de
Fri May 10 02:34:39 EDT 2002


huaiyu at gauss.almadan.ibm.com (Huaiyu Zhu) writes:

> Out of curiosity: If a character is two bytes, what would len() report?  If
> s is a unicode string with wide characters, would list(s) be made of
> characters or bytes?  

Python already supports "wide characters", by means of the unicode
type. For this type, len() reports the number of characters, not the
number of bytes used for internal storage.

> Would that be different under the current situation, or the PEP 263,
> or under Stephen's proposal?  Would it change depending on how the
> unicode is encoded?

For the Unicode type, nothing would change - Stephen did not propose
to change the Unicode type.

Instead, he proposed that non-ASCII identifiers are represented using
UTF-8 encoded byte strings (instead of being represented as Unicode
objects); in that case, and for those identifiers, len() would return
the number of UTF-8 bytes.

> A list of such simple questions and answers for various proposals
> would help many more people to understand the relevant PEPs.

I recommend you familiarize yourself with the Unicode support first
that was introduced in Python 2.0.

Regards,
Martin



More information about the Python-list mailing list