[I18n-sig] Proposal: Extended error handlingforunicode.encode

M.-A. Lemburg mal@lemburg.com
Thu, 04 Jan 2001 11:00:10 +0100


"Martin v. Loewis" wrote:
> 
> > > How does this "Unicode compression example" look like?
> >
> > Please see the Unicode.org site for a description of the
> > Unicode compression algorithm.
> 
> Specifically, http://www.unicode.org/unicode/reports/tr6/
> 
> > Other encoders will likely have similar problems, e.g. ones which
> > compress data based on locality assumptions.
> 
> Of course, the TR 6 mechanism won't have the problem at all that we
> are talking about - in section 5, it says
> 
> # The compression scheme is capable of compressing strings containing
> # any Unicode character.
> 
> so the callback for unencodable characters would never be called.

I just used it as example for the existence of encoders which need
to preserve state. 
 
> Even if it *had* to preserve state (e.g. when encoding into ISO-2022),
> Walter's proposal is that the callback returns a Unicode object that
> is encoded *instead* of the original character. I have yet to see an
> encoding scheme that would fail under this scheme: in the ISO-2022
> case, with XML character entities, the codec would know what state it
> is in, so it would know whether it has to switch to single-byte mode
> to encode the &#<number> or not.

How would such a scheme allow passing back control information
such as: continue with the next character in the stream or
break with an exception ?
 
> Looking again at the TR6 mechanism: Even if the error callback was
> called, and even if it had to return bytes instead of unicodes, it
> could still operate stateless: it would just output SQU as often as
> required. I believe that most stateful encodings have a "escape to
> known state" mechanism.

Which is what I'm talking about all along: the codecs know best
what to do, so better extend them than try to fiddle in some
information using a callback.

I don't object to adding callback support to the codec's
error handlers, but we'll need a new set of APIs to allow
this.
 
> So I still think your objection is theoretical, whereas the problem
> that Walter is trying to solve is real.

I did propose a solution which would satisfy your needs: simply
add a new error treatment 'xml-escape' to the builtin codecs
which then does the needed XML escaping. XML is general enough
to warrant such a step and the required changes are minor.

Another candidate for a new error treatment would be 'unicode-escape'
which then replaces the character in question with '\uXXXX'.

For the general case, I'd rather add new PyUnicode_EncodeEx()
and PyUnicode_DecodeEx() APIs which then take a Python
context object as extra argument. The error treatment string
would then define how to use this context object, e.g. 'callback'
could be made to apply processing similar to what Walter
suggested.

The xxxEx() APIs will have to take special precautions to also
work with pre-2.1 codecs though, since the codec API definition
does not include the extra context objext.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/