[Python-Dev] Codecs and StreamCodecs

Fred L. Drake, Jr. fdrake@acm.org
Thu, 18 Nov 1999 11:01:47 -0500 (EST)


M.-A. Lemburg writes:
 > The problem is that the encoding names are not Python identifiers,
 > e.g. iso-8859-1 is allowed as identifier. This and
 > the fact that applications may want to ship their own codecs (which
 > do not get installed under the system wide encodings package)
 > make the registry necessary.

  This isn't a substantial problem.  Try this on for size (probably
not too different from what everyone is already thinking, but let's
make it clear).  This could be in encodings/__init__.py; I've tried to 
be really clear on the names.  (No testing, only partially complete.)

------------------------------------------------------------------------
import string
import sys

try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO


class EncodingError(Exception):
    def __init__(self, encoding, error):
        self.encoding = encoding
        self.strerror = "%s %s" % (error, `encoding`)
        self.error = error
        Exception.__init__(self, encoding, error)


_registry = {}

def registerEncoding(encoding, encode=None, decode=None,
                     make_stream_encoder=None, make_stream_decoder=None):
    encoding = encoding.lower()
    if _registry.has_key(encoding):
        info = _registry[encoding]
    else:
        info = _registry[encoding] = Codec(encoding)
    info._update(encode, decode,
                 make_stream_encoder, make_stream_decoder)


def getCodec(encoding):
    encoding = encoding.lower()
    if _registry.has_key(encoding):
        return _registry[encoding]

    # load the module
    modname = "encodings." + encoding.replace("-", "_")
    try:
        __import__(modname)
    except ImportError:
        raise EncodingError("unknown uncoding " + `encoding`)

    # if the module registered, use the codec as-is:
    if _registry.has_key(encoding):
        return _registry[encoding]

    # nothing registered, use well-known names
    module = sys.modules[modname]
    codec = _registry[encoding] = Codec(encoding)
    encode = getattr(module, "encode", None)
    decode = getattr(module, "decode", None)
    make_stream_encoder = getattr(module, "make_stream_encoder", None)
    make_stream_decoder = getattr(module, "make_stream_decoder", None)
    codec._update(encode, decode,
                  make_stream_encoder, make_stream_decoder)


class Codec:
    __encode = None
    __decode = None
    __stream_encoder_factory = None
    __stream_decoder_factory = None

    def __init__(self, name):
        self.name = name

    def encode(self, u):
        if self.__stream_encoder_factory:
            sio = StringIO()
            encoder = self.__stream_encoder_factory(sio)
            encoder.write(u)
            encoder.flush()
            return sio.getvalue()
        else:
            raise EncodingError("no encoder available for " + `self.name`)

    # similar for decode()...

    def make_stream_encoder(self, target):
        if self.__stream_encoder_factory:
            return self.__stream_encoder_factory(target)
        elif self.__encode:
            return DefaultStreamEncoder(target, self.__encode)
        else:
            raise EncodingError("no encoder available for " + `self.name`)

    # similar for make_stream_decoder()...

    def _update(self, encode, decode,
                make_stream_encoder, make_stream_decoder):
        self.__encode = encode or self.__encode
        self.__decode = decode or self.__decode
        self.__stream_encoder_factory = (
            make_stream_encoder or self.__stream_encoder_factory)
        self.__stream_decoder_factory = (
            make_stream_decoder or self.__stream_decoder_factory)
------------------------------------------------------------------------

 > I don't see a problem with the registry though -- the encodings
 > package can take care of the registration process without any

  No problem at all; we just need to make sure the right magic is
there for the "normal" case.

 > PS: we could probably even take the whole codec idea one step
 > further and also allow other input/output formats to be registered,

  File formats are different from text encodings, so let's keep them
separate.  Yes, a registry can be a good approach whenever the various 
things being registered are sufficiently similar semantically, but the 
behavior of the registry/lookup can be very different for each type of 
thing.  Let's not over-generalize.


  -Fred

--
Fred L. Drake, Jr.	     <fdrake@acm.org>
Corporation for National Research Initiatives