[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Sun Aug 14 19:36:42 CEST 2011

Antoine Pitrou <pitrou at free.fr> added the comment:

> > The UTF-8 codec described by RFC 2279 didn't say so, so, since our
> > codec was following RFC 2279, it was producing valid UTF-8.  With RFC
> > 3629 a number of things changed in a non-backward compatible way.
> > Therefore we couldn't just change the behavior of the UTF-8 codec nor
> > rename it to something else in Python 2.  We had to wait till Python 3
> > in order to fix it.
> 
> I'm a bit confused on this.  You no longer fix bugs in Python 2?

In general, we try not to introduce changes that have a high probability
of breaking existing code, especially when what is being "fixed" is a
minor issue which almost nobody complains about.

This is even truer for stable branches, and Python 2 is very much a
stable branch now (no more feature releases after 2.7).

> That's why I say that you are of conformance by having encoders and decoders of UTF
> streams tolerate noncharacters.  You are not allowed to call something a UTF and do
> non-UTF things with it, because this in violation of conformance requirement C2.

Perhaps, but it is not Python's fault if the IETF and the Unicode
consortium have disagreed on what UTF-8 should be. I'm not sure what
people called "UTF-8" when support for it was first introduced in
Python, but you can't blame us for maintaining a consistent behaviour
across releases.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________