[I18n-sig] error handling in charmap-based codecs

M.-A. Lemburg mal@lemburg.com
Wed, 20 Dec 2000 20:06:23 +0100


"Martin v. Loewis" wrote:
> 
> > Most standard codecs based on the charmap codec, such as
> > iso8859_2 and koi8_r, appear not to do correct error handling.
> > Although the default error handling scheme is "strict",
> > characters that are not in a mapping are passed through without
> > decoding/encoding.  Worse, a error handling scheme specified is
> > completely ignored.

This is because I wanted to avoid having to put a huge number of 
mappings to None into the codec dictionaries. This would have
caused the codec modules and dictionaries to become much larger
than acceptable for the standard distribution. The charmap codec
was originally written to simplify writing codecs for 8-bit 
encodings. Most of these only alter a few characters and this would
warrant including mappings for all 256 characters in both directions.

> Indeed. I have filed a bug report, "Unicode encoders don't report
> errors properly",
> 
> http://sourceforge.net/bugs/?func=detailbug&bug_id=116285&group_id=5470
> 
> Unfortunately, there is disagreement whether this is a bug, or what
> the nature of the bug is.

There is ?
 
> > 1965:        /* Get mapping (char ordinal -> integer, Unicode char or None) */
> > 1966:        w = PyInt_FromLong((long)ch);
> > 1967:        if (w == NULL)
> > 1968:            goto onError;
> > 1969:        x = PyObject_GetItem(mapping, w);
> > 1970:        Py_DECREF(w);
> > 1971:        if (x == NULL) {
> > 1972:            if (PyErr_ExceptionMatches(PyExc_LookupError)) {
> > 1973:                /* No mapping found: default to Latin-1 mapping */
> > 1974:                PyErr_Clear();
> > 1975:                *p++ = (Py_UNICODE)ch;
> > 1976:                continue;
> > 1977:            }
> > 1978:            goto onError;
> > 1979:        }
> >
> > Evidently, a character not in the 'mapping' object is passed as
> > it is.  I'm not sure why the if statement shown above has been
> > put here.
> 
> I'm not sure, either. There is no documentation what the function is
> supposed to do, so it is hard to tell whether it does that correctly.

Ok, let me document it: It does what it's supposed to do :-)

> IMO, it should read
> 
>        if (x == NULL) {
>            if (PyErr_ExceptionMatches(PyExc_LookupError)) {
>                /* No mapping found: default to Latin-1 mapping */
>                PyErr_Clear();
>                x = Py_None;
>                Py_INCREF(x);
>            } else
>                goto onError;
>        }
> 
> I can't see any reason for defaulting to *Latin-1*.

See above. The encodings using the charmap codec are usually
only minor modifications of Latin-1.
 
> > A error handling scheme works as expected if the mapping object
> > returns None for an undefined key.  So, I've added the following
> > code to charmap-based codecs of mine:
> 
> Yes, that is also the proposed solution in response to my bug
> report. I don't like it at all as a solution; it's an ok work-around.
> As a solution, it is stupid: All codecs will have to pay the cost for
> UserDict accesses, and no codec makes uses of this 1:1 "feature" -
> when real solution is three-line change.

Huh ? The solution is simple: you only have to add mappings to None
as appropriate. There's no need to change the codec.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/