[I18n-sig] Proposal: Extended error handling for unicode.encode

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Fri, 22 Dec 2000 14:57:20 +0100


> Hmm, I don't think this is generally useful. Using the codec
> instances directly would be the right way to go, IMHO. I don't
> want to overload .encode() or unicode() with too much functionality.
> Writing your own function helpers which then apply all the necessary
> magic is simple and doesn't warrant changing APIs in the core.

Ok, then I have a challenge for you. Write a codec family that emits
XML character entities on encoding errors for any of the standard
Python codecs. If its really simple, then I'd *really* appreciate
concrete, working code. I really mean that - I doubt that this is
simple. If a problem arises doing it for all of the encodings, just
pick one. If that is still asked too much, outline a solution;
preferably one that is as efficient as would be the solution involving
the callback.

> Since the error handling is extensible by adding new options such as
> 'callback', the existing codecs could be extended to provide this
> functionality as well. We'd only need a way to pass the callback to
> the codecs in some way, e.g. by using a keyword argument on the
> constructor or by subclassing it and providing a new method for the
> error handling in question.

That solution is quite similar to the callback approach, so we could
probably chose either. I'm not entirely sure how the usage scenario
is. Did you think that users, instead of writing

  u.encode("koi8-r",errors=xmlcharentities)

would write

  I,forgot,which,parameter = codecs.lookup("koi8-r")
  encode = I()
  encode.install_error_cb(xmlcharentities)
  encode.encode(u,errors="callback")

or did you have a more convenient API in mind?

Also, how would I write the callback function for the koi8-r codec?

> I meant that it knows better about the current state and
> parameters of the encoding and input it is working on. The ideal
> error handling scheme would call a method on the codec which
> you could then override to provide your own handling, e.g.
> XML entity encoding.

Well, the proposed scheme *is* ideal, in that sense.

> Sure, but the more general solution needs to be well designed.
> The above trick only adds additional information to the error
> instance -- this is easy to implement and doesn't break anything.

Again, I'd like to see how the API is used - ease of implementation of
the API is not my primary concern; I'd be willing to contribute
involved implementations if they make the users' lifes easier.

> Note: simply changing the error parameter to a PyObject doesn't 
> work, since all C APIs expect a simple const char.

Sure. Looking from the Python core side of the things, it's a large
change. Looking from the users' point of view, it's a small one.

Regards,
Martin