[I18n-sig] UTF-8 decoder in CVS still buggy

M.-A. Lemburg mal@lemburg.com
Sun, 16 Jul 2000 21:54:54 +0200


Florian Weimer wrote:
> 
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> > > Thanks.  It's more consistent now, but I still don't like it. The
> > > basic question is whether a bad sequence like "c0 80" shall be
> > > replaced by one or multiple U+FFFD characters. I vote for a single
> > > replacement character because it seems natural, but different people
> > > may have different opinions here. ;-)
> >
> > Is there a standard way of dealing with these errors ?
> 
> >From Markus Kuhn's test file:
> 
> | According to ISO 10646-1, sections R.7 and 2.3c, a device receiving
> | UTF-8 shall interpret a "malformed sequence in the same way that it
> | interprets a character that is outside the adopted subset". This means
> | usually that the malformed UTF-8 sequence is replaced by a replacement
> | character (U+FFFD), which looks a bit like an inverted question mark,
> | or a similar symbol. It might be a good idea to visually distinguish a
> | malformed UTF-8 sequence from a correctly encoded Unicode character
> | that is just not available in the current font but otherwise fully
> | legal. For both cases, a clearly recognisable symbol should be used.
> | Just ignoring malformed sequences or unavailable characters will make
> | debugging more difficult and can lead to user confusion.
> 
> I've contacted Markus and he told me that the propoosed approach (i.e.
> replace the whole sequence with a replacement character) is used in
> the UTF-8 xterm extension for XFree86.  OTOH, the C library interface
> makes this approach a bit complicated to implement, so it's likely
> that each octet in a malformed sequence is replaced by a replacement
> character there.  In the future, if UTF-8-aware C libraries are widely
> deployed, xterm might use them, resulting in a changed behavior, more
> like the current Python one.

Hmm, that would be a +0 for Python's version. Markus seems
to always argue for the "replace with one character" option.

BTW, I found some discussion of the subject:
http://mail.nl.linux.org/linux-utf8/1999-10/msg00106.html
http://mail.nl.linux.org/linux-utf8/1999-09/msg00149.html

> > What do other languages do, e.g. Perl, TCL ?
> 
> Sorry, I don't know.  Anyone else?

Both have native UTF-8 support... can anyone help out on this
one ?
 
> > I don't have any problem changing the current implementation,
> > but would of course like to stick to an accepted standard here.
> 
> There doesn't seem to be any standard yet, and I doubt that there is
> already something like best common practice. :-(

Perhaps we should just wait for somebody with more UTF-8
experience to comment on this.

Whatever strategy is used, it doesn't help the user: she will
have to correct the buggy input on way or another. More
error indicating characters might make the location easier
to find but could also be more annoying.
 
> [Test module]
> 
> > 100 LOCs is ok. Would you be willing to write this up and submit
> > it as patch ?
> 
> It might take some time, but yes, I'm going to do it.

Great :-)
 
> > (What's the copyright on Markus Kuhn's test suite ?)
> 
> I got permission to use it for this task from him.  Is this
> sufficient, or do you need a disclaimer or something like that?

I guess it should be available under the Python license (or a
compatible one)... frankly, I'm not sure want the current
requirements are (Python moved from CNRI to BeOpen).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/