[Python-Dev] Codecs and StreamCodecs

M.-A. Lemburg mal@lemburg.com
Tue, 16 Nov 1999 17:00:58 +0100


Here is a new proposal for the codec interface:

class Codec:

    def encode(self,u,slice=None):
	
	""" Return the Unicode object u encoded as Python string.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is encoded.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	"""
	...

    def decode(self,s,slice=None):

	""" Return an equivalent Unicode object for the encoded Python
	    string s.

	    If slice is given (as slice object), only the sliced part
	    of the Python string is decoded and returned as Unicode
	    object.  Note that this can cause the decoding algorithm
	    to fail due to truncations in the encoding.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	""" 
	...
	

class StreamCodec(Codec):

    def __init__(self,stream=None,errors='strict'):

	""" Creates a StreamCodec instance.

	    stream must be a file-like object open for reading and/or
	    writing binary data depending on the intended codec
            action or None.

	    The StreamCodec may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are known (they need not all be supported by StreamCodec
            subclasses): 

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def write(self,u,slice=None):

	""" Writes the Unicode object's contents encoded to self.stream.

	    stream must be a file-like object open for writing binary
	    data.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def read(self,length=None):

	""" Reads an encoded string from the stream and returns
	    an equivalent Unicode object.

	    If length is given, only length Unicode characters are
	    returned (the StreamCodec instance reads as many raw bytes
            as needed to fulfill this requirement). Otherwise, all
	    available data is read and decoded.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...


It is not required by the unicodec.register() API to provide a
subclass of these base class, only the given methods must be present;
this allows writing Codecs as extensions types.  All Codecs must
provide the .encode()/.decode() methods. Codecs having the .read()
and/or .write() methods are considered to be StreamCodecs.

The Unicode implementation will by itself only use the
stateless .encode() and .decode() methods.

All other conversion have to be done by explicitly instantiating
the appropriate [Stream]Codec.
--

Feel free to beat on this one ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/