[I18n-sig] error handling in charmap-based codecs

Tamito KAJIYAMA kajiyama@grad.sccs.chukyo-u.ac.jp
Wed, 20 Dec 2000 20:05:49 +0900


Hi,

Most standard codecs based on the charmap codec, such as
iso8859_2 and koi8_r, appear not to do correct error handling.
Although the default error handling scheme is "strict",
characters that are not in a mapping are passed through without
decoding/encoding.  Worse, a error handling scheme specified is
completely ignored.

Following code excerpt from Object/unicodeobject.c points out
the problem:

1965:        /* Get mapping (char ordinal -> integer, Unicode char or None) */
1966:        w = PyInt_FromLong((long)ch);
1967:        if (w == NULL)
1968:            goto onError;
1969:        x = PyObject_GetItem(mapping, w);
1970:        Py_DECREF(w);
1971:        if (x == NULL) {
1972:            if (PyErr_ExceptionMatches(PyExc_LookupError)) {
1973:                /* No mapping found: default to Latin-1 mapping */
1974:                PyErr_Clear();
1975:                *p++ = (Py_UNICODE)ch;
1976:                continue;
1977:            }
1978:            goto onError;
1979:        }

Evidently, a character not in the 'mapping' object is passed as
it is.  I'm not sure why the if statement shown above has been
put here.

A error handling scheme works as expected if the mapping object
returns None for an undefined key.  So, I've added the following
code to charmap-based codecs of mine:

    import UserDict

    class Mapping(UserDict.UserDict):
        def __getitem__(self, key):
            return self.get(key)

    decoding_map = Mapping({
        ...
    })

    encoding_map = Mapping({})
    for k, v in decoding_map.items():
        encoding_map[v] = k

Either Objects/unicodeobject.c or the charmap-based codecs need
a fix, I think.

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>