Issues with `codecs.register` and `codecs.CodecInfo` objects

Fri Jul 6 12:55:31 EDT 2012

Hello all,

While attempting to make a wrapper for opening multiple types of
UTF-encoded files (more on that later, in a separate post, I guess), I
ran into some oddities with the `codecs` module, specifically to do
with `.register` ing `CodecInfo` objects. I'd like to report a bug or
something, but there are several intertangled issues here and I'm not
really sure how to report it so I thought I'd open the discussion.
Apologies in advance if I get a bit rant-y, and a warning that this is
fairly long.

Observe what happens when you `register` the wrong function:

    >>> import codecs
    >>> def ham(name):
    ...     # Very obviously wrong, just for demonstration purposes
    ...     if name == 'spam': return 'eggs'
    ...
    >>> codecs.register(ham)

Already there is a problem in that there is no error... there is no
realistic way to catch this, of course, but IMHO it points to an issue
with the interface. I don't want to register a codec lookup function;
I want to register *a codec*. The built-in lookup process would be
just fine if I could just somehow tell it about this one new codec I
have... I really don't see the use case for the added flexibility of
the current interface, and it means that every time I have a new
codec, I need to either create a new lookup function as well (to
register it), or hook into an existing one that's still of my own
creation.

Anyway, moving on, let's see what happens when we try to use the faulty codec:

    >>> codecs.getencoder('spam')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python32\lib\codecs.py", line 939, in getencoder
        return lookup(encoding).encode
    TypeError: codec search functions must return 4-tuples

Ehh?! That's odd. I thought I was supposed to return a `CodecInfo`
object, not a 4-tuple! Although as an aside, AFAICT the documentation
*doesn't actually document the CodecInfo class*, it just says what
attributes CodecInfo objects are supposed to have.

A bit of digging around with Google and existing old bugs on the
tracker suggests that this comes about due to backwards-compatibility:
in 2.4 and below, they *were* 4-tuples. But now CodecInfo objects are
expected to provide 6 functions (and a name), not 4. Clearly that
won't fit in a 4-tuple, and anyway I thought we had gotten rid of all
this deprecated stuff.

Regardless, let's see what happens if we do try to register a 4-tuple-lookup-er:

    >>> def spam(name):
    ...     # As long as we return a 4-tuple, it doesn't really matter
what the functions are;
    ...     # errors shouldn't happen until we actually attempt to
encode/decode. Right?
    ...     if name == 'spam': return (spam, spam, spam, spam)

Oops, we need to restart the interpreter, or otherwise reset global
state somehow, because the old lookup function has priority over this
one, and *there is no way to unregister it*. But once that's fixed:

    >>> codecs.getencoder('spam')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python32\lib\codecs.py", line 939, in getencoder
        return lookup(encoding).encode
    AttributeError: 'tuple' object has no attribute 'encode'

That's quite odd indeed. We can't actually trust the error message we
got before! 4-tuples don't work any more like they used to, so our
backwards-compatibility concession doesn't even work. Meanwhile, we're
left wondering how CodecInfo objects work at all. Is the error message
wrong?

Nope, well, not really. Let's grab an known good CodecInfo object and
see what we can find out...

    >>> utf8 = codecs.lookup('utf-8')
    >>> utf8.__class__.__bases__
    (<class 'tuple'>,)
    >>> # not collections.namedtuple, which is understandable, since
that wasn't available until 2.6...
    >>> len(utf8)
    4
    >>> # OK, apparently it magically actually is a tuple of length 4
despite needing 7 attributes. I wonder which ones are included:
    >>> tuple(utf8)
    (<built-in function utf_8_encode>, <function decode at
0x01993390>, <class 'encodings.utf_8.StreamReader'>, <class
'encodings.utf_8.StreamWriter'>)
    >>> # Unsurprising: the ones mandated by the original PEP (100!
That long ago...)

... and if we try `help` (or look at examples in the standard library
or find them with Google - but I sure don't see any in the webpage
docs), we can at least find out how to construct a CodecInfo object
properly - although, curiously, it's implemented using `__new__`
rather than `__init__`.

You *can* hack around with `collections.namedtuple` and create
something that basically works:

    # restarting again...
    >>> import codecs, collections
    >>> my_codecinfo = collections.namedtuple('my_codecinfo', 'encode
decode streamreader streamwriter')
    >>> def spam(name):
    ...     if name == 'spam': return my_codecinfo(spam, spam, spam, spam)

And now the error correctly doesn't occur until we actually attempt to
encode or decode something. Except we still don't have an incremental
decoder/encoder, and in fact those are missing attributes rather than
`None` as they're defaulted to by the `CodecInfo` class. (Of course,
we can subclass `collections.namedtuple` to fix this, but then we're
basically reverse-engineering the `codecs.CodecInfo` class
wholesale...)

Speaking of which, one last thing:

    >>> # Another restart, of course
    >>> import codecs
    >>> def spam(name):
    ...     if name == 'spam': return codecs.CodecInfo(spam, spam)
    ...
    >>> codecs.register(spam)
    >>> codecs.getincrementaldecoder('spam')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python32\lib\codecs.py", line 976, in getincrementaldecoder
        raise LookupError(encoding)
    LookupError: spam

That seems wrong to me too: the codec is certainly *there*, it just
doesn't support incremental decoding. I would expect the error message
to be more specific.

--
~Zahlman {:>