[I18n-sig] Proposal: Extended error handling for unicode.encode

"Walter Dörwald" walter@amazonas.livinglogic.de
Fri, 22 Dec 2000 16:32:31 +0100


On 21.12.00 at 18:30 M.-A. Lemburg wrote:

> "Martin v. Loewis" wrote:
> > 
> > > The problem with this is that the error handler will usually
> > > have to have access to the internal data structure of the codec
> > > to be able to process the error, e.g. <char> in your example
> > > could be a single character, a UTF-16 sequence, etc.
> > 
> > Please note that in his encoding, char is a Unicode string
> > (specifically, character), so it can't be a UTF-16 sequence.
> > What *encoder* that you know needs to have internal state?
> 
> The codec is much general and kept symmetric for obvious reasons.
> In his case, char would be a Unicode string, but the input to
> an encoder could just as well be an image, a sound or some other
> abstract form of data storage. It is not unlikely that these
> encoder will need to keep state.
> 
> Even for Unicode you will need to keep state in the encoder,
> e.g. to write an encoder which uses the Unicode compression
> algorithm as basis (the output stream contains markers to
> switch pages).

But I don't see how this internal encoder state should influence
what the error handler does. There are two layers involved: The
character encoding layer and the "unencodable character escape
mechanism". Both layers are completely independent, even in your
"Unicode compression" example, where the "unencodable character 
escape mechanism" is XML character entities.

> > Anyway, if you think that state should be accessible to the error
> > handling function, it won't be hard to pass state to the callback.
> > E.g. you could pass the string being encoded, the current position,
> > and optionally a Codec instance (many codecs would pass None, as they
> > don't keep any state).
> 
> Hmm, I don't think this is generally useful. Using the codec
> instances directly would be the right way to go, IMHO. I don't
> want to overload .encode() or unicode() with too much functionality.

We're only talking about encoding here. You right that state might
be required for a decoder.

> Writing your own function helpers which then apply all the necessary
> magic is simple and doesn't warrant changing APIs in the core.

It is not as simple as the error handler, but I could live with that.

The big problem is that it effectively kill the speed of your
application. Every XML application written in Python, no matter
what is does internally, will in the end have to produce an output
bytestring. Normally the output encoding should be one that produces
no unencodable characters, but you have to be prepared to handle
them. With the error handler the complete encoding will be done
in C code (with very infrequent calls to the error handler), so
this scheme gives the best speed possible.

> Since the error handling is extensible by adding new options
> such as 'callback',

I would prefer a more object oriented way of extending the error 
handling.

> the existing codecs could be extended to
> provide this functionality as well. We'd only need a way to
> pass the callback to the codecs in some way, e.g. by using
> a keyword argument on the constructor or by subclassing it
> and providing a new method for the error handling in question.

There is no need for a string argument 'callback' and
an additional callback function/method that is passed to the
encoder. When the error argument is a string, the old mechanism
can be used, when it is a callable object the new will be used.

> [...]

Bye,
   Walter Dörwald

-- 
Walter Dörwald · LivingLogic AG · Bayreuth, Germany ·
www.livinglogic.de