[I18n-sig] Proposal: Extended error handling
forunicode.encode
"Walter Dörwald"
walter@livinglogic.de
Wed, 03 Jan 2001 20:18:58 +0100
On 22.12.00 at 19:15 M.-A. Lemburg wrote:
> "Walter Dörwald" wrote:
> >
> > On 21.12.00 at 18:30 M.-A. Lemburg wrote:
> > > [about state in encoders and error handlers]
> > But I don't see how this internal encoder state should influence
> > what the error handler does. There are two layers involved: The
> > character encoding layer and the "unencodable character escape
> > mechanism". Both layers are completely independent, even in your
> > "Unicode compression" example, where the "unencodable character
> > escape mechanism" is XML character entities.
>
> This is true for your XML entity escape example, but error
> resolving in general will likely need to know about the
> current state of the encoder, e.g. to be able to write data
> corresponding page in the Unicode compression example or to
> force a switch of the current page to a different one.
How does this "Unicode compression example" look like?
> I know that error handling could be more generic, but passing
> a callable object instead of the error parameter is not an
> option since the internal APIs all use a const char parameter
> for error.
Changing this should can be done in one or two hours for someone
who knows the Python internals. (Unfortunately I don't, I first
looked at unicodeobject.[hc] several days ago!)
> Besides, I consider such an approach a hack and not
> a solution.
>
> Instead of trying to tweak the implementation into providing
> some kind of new error scheme, let's focus on finding a generic
> framework which could provide a solution for the general case
> while not breaking the existing applications.
Are the existing codecs (JapaneseCodecs etc.) to be considered part
of the existing applications?
The problem might be how to handle callbacks to C functions and
callback to Python functions in a consistent way. I.e. is it
extern DL_IMPORT(PyObject*) PyUnicode_Encode(
const Py_UNICODE *s, /* Unicode char buffer */
int size, /* number of Py_UNICODE chars to encode */
const char *encoding, /* encoding */
PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position) /* error handling via C function */
);
or
extern DL_IMPORT(PyObject*) PyUnicode_Encode(
const Py_UNICODE *s, /* Unicode char buffer */
int size, /* number of Py_UNICODE chars to encode */
const char *encoding, /* encoding */
PyObject *errorHandler /* error handling via Python function */
);
> > > Writing your own function helpers which then apply all the necessary
> > > magic is simple and doesn't warrant changing APIs in the core.
> >
> > It is not as simple as the error handler, but I could live with that.
> >
> > The big problem is that it effectively kill the speed of your
> > application. Every XML application written in Python, no matter
> > what is does internally, will in the end have to produce an output
> > bytestring. Normally the output encoding should be one that produces
> > no unencodable characters, but you have to be prepared to handle
> > them. With the error handler the complete encoding will be done
> > in C code (with very infrequent calls to the error handler), so
> > this scheme gives the best speed possible.
>
> It would give even better performance if the codec would provide
> this hook in some way at C level.
extern DL_IMPORT(PyObject*) PyUnicode_Encode(
const Py_UNICODE *s, /* Unicode char buffer */
int size, /* number of Py_UNICODE chars to encode */
const char *encoding, /* encoding */
PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position) /* error handling via C function */
);
would, but thats not the point. When you use an encoding, where more
than 20% of the characters have to be escaped (as XML entities or whatever)
you're using the wrong encoding.
> Note that almost all codecs
> have their own error handlers written in C already.
>
> > > Since the error handling is extensible by adding new options
> > > such as 'callback',
> >
> > I would prefer a more object oriented way of extending the error
> > handling.
>
> Sure, but we have to assure backward compatibility as well.
>
> > > the existing codecs could be extended to
> > > provide this functionality as well. We'd only need a way to
> > > pass the callback to the codecs in some way, e.g. by using
> > > a keyword argument on the constructor or by subclassing it
> > > and providing a new method for the error handling in question.
> >
> > There is no need for a string argument 'callback' and
> > an additional callback function/method that is passed to the
> > encoder. When the error argument is a string, the old mechanism
> > can be used, when it is a callable object the new will be used.
>
> This is bad style and also gives problems in the core
> implementation (have a look at unicodeobject.c).
I did, what is the problem with changing "const char *error" to
"PyObject *error"?
Bye,
Walter Dörwald
--
Walter Dörwald · LivingLogic AG · Bayreuth, Germany · www.livinglogic.de