[I18n-sig] Proposal: Extended error handlingforunicode.encode

M.-A. Lemburg mal@lemburg.com
Wed, 03 Jan 2001 21:17:59 +0100


"Walter Dörwald" wrote:
> 
> On 22.12.00 at 19:15 M.-A. Lemburg wrote:
> 
> > "Walter Dörwald" wrote:
> > >
> > > On 21.12.00 at 18:30 M.-A. Lemburg wrote:
> > > > [about state in encoders and error handlers]
> > > But I don't see how this internal encoder state should influence
> > > what the error handler does. There are two layers involved: The
> > > character encoding layer and the "unencodable character escape
> > > mechanism". Both layers are completely independent, even in your
> > > "Unicode compression" example, where the "unencodable character
> > > escape mechanism" is XML character entities.
> >
> > This is true for your XML entity escape example, but error
> > resolving in general will likely need to know about the
> > current state of the encoder, e.g. to be able to write data
> > corresponding page in the Unicode compression example or to
> > force a switch of the current page to a different one.
> 
> How does this "Unicode compression example" look like?

Please see the Unicode.org site for a description of the
Unicode compression algorithm. Other encoders will likely
have similar problems, e.g. ones which compress data based
on locality assumptions.

> > I know that error handling could be more generic, but passing
> > a callable object instead of the error parameter is not an
> > option since the internal APIs all use a const char parameter
> > for error.
> 
> Changing this should can be done in one or two hours for someone
> who knows the Python internals. (Unfortunately I don't, I first
> looked at unicodeobject.[hc] several days ago!)

Sure, but it would break code and alter the Python C API
in unacceptable ways. Note that all builtin C codecs would
also have to be changed.

If we are going to extend the error handling mechanism then
we'd better do it some b/w compatible way, e.g. by providing
new APIs.

> > Besides, I consider such an approach a hack and not
> > a solution.
> >
> > Instead of trying to tweak the implementation into providing
> > some kind of new error scheme, let's focus on finding a generic
> > framework which could provide a solution for the general case
> > while not breaking the existing applications.
> 
> Are the existing codecs (JapaneseCodecs etc.) to be considered part
> of the existing applications?

All code out there which uses the existing codecs and APIs
must be considered when thinking about altering published
Python C APIs.

> The problem might be how to handle callbacks to C functions and
> callback to Python functions in a consistent way. I.e. is it
> extern DL_IMPORT(PyObject*) PyUnicode_Encode(
>      const Py_UNICODE *s,        /* Unicode char buffer */
>      int size,                   /* number of Py_UNICODE chars to encode */
>      const char *encoding,       /* encoding */
>      PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position) /* error handling via C function */
>      );
> or
> extern DL_IMPORT(PyObject*) PyUnicode_Encode(
>      const Py_UNICODE *s,        /* Unicode char buffer */
>      int size,                   /* number of Py_UNICODE chars to encode */
>      const char *encoding,       /* encoding */
>      PyObject *errorHandler /* error handling via Python function */
>      );

The latter would be the "right" solution.
 
> > > > Writing your own function helpers which then apply all the necessary
> > > > magic is simple and doesn't warrant changing APIs in the core.
> > >
> > > It is not as simple as the error handler, but I could live with that.
> > >
> > > The big problem is that it effectively kill the speed of your
> > > application. Every XML application written in Python, no matter
> > > what is does internally, will in the end have to produce an output
> > > bytestring. Normally the output encoding should be one that produces
> > > no unencodable characters, but you have to be prepared to handle
> > > them. With the error handler the complete encoding will be done
> > > in C code (with very infrequent calls to the error handler), so
> > > this scheme gives the best speed possible.
> >
> > It would give even better performance if the codec would provide
> > this hook in some way at C level.
> 
> extern DL_IMPORT(PyObject*) PyUnicode_Encode(
>      const Py_UNICODE *s,        /* Unicode char buffer */
>      int size,                   /* number of Py_UNICODE chars to encode */
>      const char *encoding,       /* encoding */
>      PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position) /* error handling via C function */
>      );
> would, but thats not the point. When you use an encoding, where more
> than 20% of the characters have to be escaped (as XML entities or whatever)
> you're using the wrong encoding.

That's what I was talking about all along... if it's really
only for escaping XML, then a special Latin-1 or ASCII XML excaping
codec would go a long way (without the troubles of using callbacks
and without having to add a new error callback mechanism).

Writing such a codec doesn't take much time, since the
code's already there. Even better: XML escaping could be added
as new error handling option, e.g. "xml-escape" instead of
"replace".

Since XML escaping is general enough, I do think that adding
such an option to all builtin codecs would be an acceptable
and workable solution.

> > Note that almost all codecs
> > have their own error handlers written in C already.
> >
> > > > Since the error handling is extensible by adding new options
> > > > such as 'callback',
> > >
> > > I would prefer a more object oriented way of extending the error
> > > handling.
> >
> > Sure, but we have to assure backward compatibility as well.
> >
> > > > the existing codecs could be extended to
> > > > provide this functionality as well. We'd only need a way to
> > > > pass the callback to the codecs in some way, e.g. by using
> > > > a keyword argument on the constructor or by subclassing it
> > > > and providing a new method for the error handling in question.
> > >
> > > There is no need for a string argument 'callback' and
> > > an additional callback function/method that is passed to the
> > > encoder. When the error argument is a string, the old mechanism
> > > can be used, when it is a callable object the new will be used.
> >
> > This is bad style and also gives problems in the core
> > implementation (have a look at unicodeobject.c).
> 
> I did, what is the problem with changing "const char *error" to
> "PyObject *error"?

Backward compatibility. We can't change C API signatures
after they have been officially published. The Python way to
apply these kind of changes would be to add new extended APIs.
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/