[I18n-sig] Proposal: Extended error handlingforunicode.encode

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Fri, 5 Jan 2001 10:08:09 +0100


> Sure, exceptions are not much of a problem, but how would the
> callback tell the encoder/decoder to e.g. skip forward 2 bytes or
> perhaps backward 10 bytes ?

First, I'd like to point out that encoding and decoding is *not*
symmetric with regards to error handling, so there is *no* need to
make the interfaces appear symmetric; it is rather unfortunate that
Python 2 gives this impression.

The reason for the difference is that converting from some encoding to
Unicode never fails for virtually all encodings because of missing
characters in Unicode - Unicode is supposed to support almost
everything, and code sets that cannot completely map into Unicode
probably need special attention anyway (normally, by producing a
non-reversible mapping). So the callback is not needed at all for
decoding.

For encoding, my claim is that error callbacks never want to skip
forward 2 bytes. If anything, then go forward two characters - but I
can't even imagine a scenario where that would be needed. Don't try to
design an API that nobody will ever use.

Walter has demonstrated how to implement the "skip the current
character" case: by returning u"" from the callback.

> What if the callback would have to scan the stream from the
> beginning to find out where to continue or look ahead a few hundred
> bytes to find the next valid encodable sequence ?

What would be the specific encoding, and what would be the specific
error handling algorithm that would require such a service?

> Again, you should keep in mind that the scheme has to work
> for all encoding/decoding work, not only conversion from and
> to Unicode.

Why is that? That sounds like gross overgeneralization to me.
Specifically, do you know anybody using that framework for anything
but Unicode conversion? If so, who is that, and what is the specific
application?

> If we were to provide a callback as optional method to 
> StreamReaders/Writers, the task could be done either statically
> by subclassing the existing codec StreamReaders/Writers or
> dynamically by asking the codec registry to return the StreamReader/
> Writer classes.

So how would the implementation of charmap_encode invoke this method?
It currently doesn't even get hold of the codec object.

> Another option would be 'copy' which tries to simply copy input
> to output in case this is reasonably possible given the encoding
> (e.g. Unicode -> 8-bit encoding would copy all 8-bit Unicode chars as
> is in case no mapping is defined). An option 'raise' could also
> be valuable in conjunction with an exception context object to have
> the codec raise customized exceptions. Provided the context
> object points to another encoder/decoder, an option 'fallback'
> could be used to tell the codec to pass the failing input data
> to the alternate encoder/decoder in order to have it converted.
> Etc. etc. 
> 
> There are many things one could do with the error string.

I guess my question is different: Do you consider the error string to
be of a well-defined finite enumerated set of possible values, or is
it your view that it is up to the codec what error strings to accept?
If so, why would they have to be strings?

Regards,
Martin