[I18n-sig] UTF-8 decoder in CVS still buggy

Florian Weimer fw@deneb.enyo.de
23 Jul 2000 14:03:36 +0200


"M.-A. Lemburg" <mal@lemburg.com> writes:

> > > What do other languages do, e.g. Perl, TCL ?
> > 
> > Sorry, I don't know.  Anyone else?
> 
> Both have native UTF-8 support... can anyone help out on this
> one ?

Perl's UTF-8 support is still extremely rudimentary.  Even Larry seems
to admit that.  The general Perl philosophy seems to preserve invalid
UTF-8 sequences.  (They use UTF-8 for their strings, that's why they
can do this.)  This is not applicable to Python, I think.

Tcl seems to assume that invalid UTF-8 sequences are ISO-8859-1.  At
least this is what the code seems to do, its documentation says that
replacement characters are used.  It doesn't handle overlong sequences
properly (contrary to the recommendation in RFC 2279).

In Java, the behavior of the UTF-8 decoder is not specified in the
language definition, which probably means that Java implementations
differ a lot in this region.

> Whatever strategy is used, it doesn't help the user: she will
> have to correct the buggy input on way or another. More
> error indicating characters might make the location easier
> to find but could also be more annoying.

Anyway, I think we can agree that a single replacement character shall
be used in the following cases:

        - a valid UTF-8 sequence which encodes an UCS-4 character not
          representable in UTF-16
        
        - a UTF-8 sequence which is an overlong representation of a
          character, but otherwise correct

For the remaining cases, I would vote for the "one replacement
character per source octet".  After some thinking, this seems to be
the most natural approach to me.  If the UTF-8 stream is garbled,
there's no point in being clever and trying to gues character bounds,
because this information is very likely meaningless anyway.

As a safety mesaure, I'd suggest to state that Python's behavior
may change in a later version if the chosen approach proves to be
inadequate.