[I18n-sig] error handling in charmap-based codecs
Tamito KAJIYAMA
kajiyama@grad.sccs.chukyo-u.ac.jp
Wed, 20 Dec 2000 20:05:49 +0900
Hi,
Most standard codecs based on the charmap codec, such as
iso8859_2 and koi8_r, appear not to do correct error handling.
Although the default error handling scheme is "strict",
characters that are not in a mapping are passed through without
decoding/encoding. Worse, a error handling scheme specified is
completely ignored.
Following code excerpt from Object/unicodeobject.c points out
the problem:
1965: /* Get mapping (char ordinal -> integer, Unicode char or None) */
1966: w = PyInt_FromLong((long)ch);
1967: if (w == NULL)
1968: goto onError;
1969: x = PyObject_GetItem(mapping, w);
1970: Py_DECREF(w);
1971: if (x == NULL) {
1972: if (PyErr_ExceptionMatches(PyExc_LookupError)) {
1973: /* No mapping found: default to Latin-1 mapping */
1974: PyErr_Clear();
1975: *p++ = (Py_UNICODE)ch;
1976: continue;
1977: }
1978: goto onError;
1979: }
Evidently, a character not in the 'mapping' object is passed as
it is. I'm not sure why the if statement shown above has been
put here.
A error handling scheme works as expected if the mapping object
returns None for an undefined key. So, I've added the following
code to charmap-based codecs of mine:
import UserDict
class Mapping(UserDict.UserDict):
def __getitem__(self, key):
return self.get(key)
decoding_map = Mapping({
...
})
encoding_map = Mapping({})
for k, v in decoding_map.items():
encoding_map[v] = k
Either Objects/unicodeobject.c or the charmap-based codecs need
a fix, I think.
Regards,
--
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>