[Python-Dev] Stateful codecs [Was: str object going in Py3K]

Sat Feb 18 17:11:39 CET 2006

M.-A. Lemburg wrote:
> Walter Dörwald wrote:
>>>>> I'd suggest we keep codecs.lookup() the way it is and
>>>>> instead add new functions to the codecs module, e.g.
>>>>> codecs.getencoderobject() and codecs.getdecoderobject().
>>>>>
>>>>> Changing the codec registration is not much of a problem:
>>>>> we could simply allow 6-tuples to be passed into the
>>>>> registry.
>>>> OK, so codecs.lookup() returns 4-tuples, but the registry stores 6-tuples and the search functions must return 6-tuples.
>>>> And we add codecs.getencoderobject() and codecs.getdecoderobject() as well as new classes codecs.StatefulEncoder and
>>>> codecs.StatefulDecoder. What about old search functions that return 4-tuples?
>>>
>>> The registry should then simply set the missing entries to None and the getencoderobject()/getdecoderobject() would then
>>> have
>>> to raise an error.
>>
>> Sounds simple enough and we don't loose backwards compatibility.
>>
>>> Perhaps we should also deprecate codecs.lookup() in Py 2.5 ?!
>>
>> +1, but I'd like to have a replacement for this, i.e. a function that returns all info the registry has about an encoding:
>>
>> 1. Name
>> 2. Encoder function
>> 3. Decoder function
>> 4. Stateful encoder factory
>> 5. Stateful decoder factory
>> 6. Stream writer factory
>> 7. Stream reader factory
>>
>> and if this is an object with attributes, we won't have any problems if we extend it in the future.
>
> Shouldn't be a problem: just expose the registry dictionary
> via the _codecs module.
>
> The rest can then be done in a Python function defined in
> codecs.py using a CodecInfo class.

This would require the Python code to call codecs.lookup() and then look into the codecs dictionary (normalizing the encoding
name again). Maybe we should make a version of __PyCodec_Lookup() that allows 4- and 6-tuples available to Python and use that?
The official PyCodec_Lookup() would then have to downgrade the 6-tuples to 4-tuples.
>> BTW, if we change the API, can we fix the return value of the stateless functions? As the stateless function always
>> encodes/decodes the complete string, returning the length of the string doesn't make sense.
>> codecs.getencoder() and codecs.getdecoder() would have to continue to return the old variant of the functions, but
>> codecs.getinfo("latin-1").encoder would be the new encoding function.
>
> No: you can still write stateless encoders or decoders that do
> not process the whole input string. Just because we don't have
> any of those in Python, doesn't mean that they can't be written
> and used. A stateless codec might want to leave the work
> of buffering bytes at the end of the input data which cannot
> be processed to the caller.

But what would the call do with that info? It can't retry encoding/decoding the rejected input, because the state of the codec
has been thrown away already.
> It is also possible to write
> stateful codecs on top of such stateless encoding and decoding
> functions.

That's what the codec helper functions from Python/_codecs.c are for.

Anyway, I've started implementing a patch that just adds codecs.StatefulEncoder/codecs.StatefulDecoder. UTF8, UTF8-Sig, UTF-16,
UTF-16-LE and UTF-16-BE are already working.
Bye,
    Walter Dörwald