[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Sat Aug 13 22:26:21 CEST 2011

Antoine Pitrou <pitrou at free.fr> added the comment:

> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
> Perhaps someone could tell me why the Python documentation says it uses
> UCS-2 on a narrow build.

There's a disagreement on that point between several developers. See an example sub-thread at:
http://mail.python.org/pipermail/python-dev/2010-November/105751.html

> Since you are already using a variable-width encoding, why the
> supercilious attitude toward UTF-8?

I think you are reading too much into these decisions. It's simply that no-one took the time to write an alternative implementation and demonstrate its superiority. I also believe the original implementation was UCS-2 and surrogate support was added progressively during the years. Hence the terminological mess and the ad-hoc semantics.

I agree that going with UTF-8 and a clever indexing scheme would be a better solution.

----------
nosy: +pitrou

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________