[Python-Dev] multibytecodex

Victor Stinner victor.stinner at haypocalc.com
Thu May 26 00:13:42 CEST 2011


Le mercredi 25 mai 2011 à 23:41 +0200, Laura Creighton a écrit :
> One reason I didn't implement the classes yet is that I couldn't
> understand two points in how they are supposed to work.  But it seems
> that there are really two bugs, as I've been pointed to:
> http://bugs.python.org/issue12100 and
> http://bugs.python.org/issue12171 .  So the question is if we should
> be bug-compatible with Python 2.7 or if we should instead implement
> some fixed version.

I fixed #12100 in Python 2.7, 3.1, 3.2, 3.3 yesterday.

I plan also to fix #12171 in these four versions, it should be done next
days.

> I suppose I'm rather for the fixed version, but I'd like to hear some
> feedback from people that actually use multibytecodecs.

Both bugs are related to encoders. I don't think that anyone is using
Python CJK codecs to encode text (because nobody noticed these bugs
before), but more likely to decode text.

Anyway, you should implement a codec without these *bugs*.

For your information, I added more tests to the CJK codecs (e.g. see
#12057), and I plan to add more tests next weeks. I plan also to fix
issue #12016, yet another CJK codec bug. You may want to wait until all
of these bugs are fixed before working on your own implementation, or
implement directly a version without all of these bugs, and then upgrade
the test suite.

> Also, I wouldn't mind if someone would pick up the work and just do it,
> either the classes or ``errors !=3D "strict"'' :-)

The support of error handlers different than strict is far from being
perfect. Issue #12016 is the main problem, but there are other minor
issues.

In some cases, invalid byte sequences are ignored even with the replace
error handler (whereas I expected U+FFFD characters). CJK codecs are
special because they use escape sequences (especially the ISO 2022
family): what should be done if a byte sequence looks like an escape
sequences, but it is not valid? Replace each byte by U+FFFD, or ignore
these bytes?

I'm trying to write tests "describing" the current behaviour, and then I
will maybe try to improve how invalid byte sequences are handled.

Victor



More information about the Python-Dev mailing list