PEP 263 status check
"Martin v. Löwis"
martin at v.loewis.de
Fri Aug 6 08:20:41 EDT 2004
John Roth wrote:
> Or are you trying to say that the character string will
> contain the UTF-8 encoding of these characters; that
> is, if I do a subscript, I will get one character of the
> multi-byte encoding?
Michael is almost right: this is what happens. Except that
what you get, I wouldn't call a "character". Instead, it
is always a single byte - even if that byte is part of
a multi-byte character.
Unfortunately, the things that constitute a byte string
are also called characters in the literature.
To be more specific: In an UTF-8 source file, doing
print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"
would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.
> The point of this is that I don't think that either behavior
> is what one would expect. It's also an open invitation
> for someone to make an unchecked mistake! I think this
> may be Hallvard's underlying issue in the other thread.
What would you expect instead? Do you think your expectation
is implementable?
Regards,
Martin
More information about the Python-list
mailing list