[I18n-sig] Proposal: Extended error handlingforunicode.encode

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 4 Jan 2001 02:09:23 +0100


> > How does this "Unicode compression example" look like?
> 
> Please see the Unicode.org site for a description of the
> Unicode compression algorithm. 

Specifically, http://www.unicode.org/unicode/reports/tr6/

> Other encoders will likely have similar problems, e.g. ones which
> compress data based on locality assumptions.

Of course, the TR 6 mechanism won't have the problem at all that we
are talking about - in section 5, it says

# The compression scheme is capable of compressing strings containing
# any Unicode character.

so the callback for unencodable characters would never be called.

Even if it *had* to preserve state (e.g. when encoding into ISO-2022),
Walter's proposal is that the callback returns a Unicode object that
is encoded *instead* of the original character. I have yet to see an
encoding scheme that would fail under this scheme: in the ISO-2022
case, with XML character entities, the codec would know what state it
is in, so it would know whether it has to switch to single-byte mode
to encode the &#<number> or not.

Looking again at the TR6 mechanism: Even if the error callback was
called, and even if it had to return bytes instead of unicodes, it
could still operate stateless: it would just output SQU as often as
required. I believe that most stateful encodings have a "escape to
known state" mechanism.

So I still think your objection is theoretical, whereas the problem
that Walter is trying to solve is real.

Regards,
Martin