[I18n-sig] Proposal: Extended error handlingforunicode.encode

"Walter Dörwald" walter@livinglogic.de
Mon, 08 Jan 2001 19:59:43 +0100


On 04.01.01 at 11:00 M.-A. Lemburg wrote:

> [...]
>  
> > Even if it *had* to preserve state (e.g. when encoding into ISO-2022),
> > Walter's proposal is that the callback returns a Unicode object that
> > is encoded *instead* of the original character. I have yet to see an
> > encoding scheme that would fail under this scheme: in the ISO-2022
> > case, with XML character entities, the codec would know what state it
> > is in, so it would know whether it has to switch to single-byte mode
> > to encode the &#<number> or not.
> 
> How would such a scheme allow passing back control information
> such as: continue with the next character in the stream

def ignore(encoding, string, position):
	return u""

u"xxx".encode(encoding, 'callback', ignore)

> or break with an exception ?

def raiseAnException(encoding, string, position):
	raise FancyException("can't encode character %r at position %d in string %r with encoding %s" 
		% (string[position], position, string, encoding))

u"xxx".encode(encoding, 'callback', raiseAnException)

> > Looking again at the TR6 mechanism: Even if the error callback was
> > called, and even if it had to return bytes instead of unicodes, it
> > could still operate stateless: it would just output SQU as often as
> > required. I believe that most stateful encodings have a "escape to
> > known state" mechanism.
> 
> Which is what I'm talking about all along: the codecs know best
> what to do, so better extend them than try to fiddle in some
> information using a callback.

The callback is only used in the situation when the codec does
not know what to do, i.e. when it encounters an unencodable
character. The callback is an *error handler* and not a
"I don't know how to implement my own encoding algorithm,
please help me"-handler. >;->

> I don't object to adding callback support to the codec's
> error handlers, but we'll need a new set of APIs to allow
> this.

I could live with a
	u"xxx".encode(encoding, 'callback', handler)
on the Python side, but what does this mean for the C API?

> > So I still think your objection is theoretical, whereas the problem
> > that Walter is trying to solve is real.
> 
> I did propose a solution which would satisfy your needs: simply
> add a new error treatment 'xml-escape' to the builtin codecs
> which then does the needed XML escaping. XML is general enough
> to warrant such a step and the required changes are minor.
> 
> Another candidate for a new error treatment would be 'unicode-escape'
> which then replaces the character in question with '\uXXXX'.
> 
> For the general case, I'd rather add new PyUnicode_EncodeEx()
> and PyUnicode_DecodeEx() APIs which then take a Python
> context object as extra argument. 

What should this extra argument be for the decoder?

> The error treatment string
> would then define how to use this context object, e.g. 'callback'
> could be made to apply processing similar to what Walter
> suggested.

'callback' seem too generic to me. May there will be other callbacks
in the future that are used for different stuff. This is the
"give me a replacement or die" error handler.

> The xxxEx() APIs will have to take special precautions to also
> work with pre-2.1 codecs though, since the codec API definition
> does not include the extra context objext.


Bye,
   Walter Dörwald

-- 
Walter Dörwald · LivingLogic AG · Bayreuth, Germany · www.livinglogic.de