[XML-SIG] Re: Issues with Unicode type

Martin v. Loewis martin@v.loewis.de
24 Sep 2002 01:06:10 +0200


Uche Ogbuji <uche.ogbuji@fourthought.com> writes:

> I think the real problem is rather than nothing says that len()
> operating on Unicode objects is *not* a count of characters.  There is
> nothing that says that len is strictly a count of storage values.  I
> think it's perfectly natural to assume len() is a count of characters,
> and Python's docs should be clarified in this regard.  

I somewhat disagree. For over a year, I think this is the first time
that anybody ever noticed. By the time somebody notices the next time,
we might be all using UCS-4 builds, and the problem is gone.

> Consider that other built-ins such as repr and the literal parsing
> code does deal in characters and not storage values.  So why should
> anyone expect len() to be different.

Actually, up to Python 2.3, literal parsing operates on bytes, not
characters. If you have a non-ASCII encoding in your sources, the
escape backslash would escape only the next byte - which may or may
not be the next character.

Again, few people ever notice.

> As I said the main problem I see with all this in Python is
> inconsistency and lack of docs.

You are just not reading all the docs. There is a PEP that spells out
all these details, deeper than you ever wanted to know.

Regards,
Martin