[Python-Dev] Decoding incomplete unicode

Thu Aug 26 22:19:48 CEST 2004

M.-A. Lemburg wrote:

> Martin v. Löwis wrote:
> 
>> M.-A. Lemburg wrote:
>>
>>> Martin, there are two reasons for hiding away these details:
>>>
>>> 1. we need to be able to change the codec state without
>>>    breaking the APIs
>>
>> That will be possible with the currently-proposed patch.
>> The _codecs methods are not public API, so changing them
>> would not be an API change.
> 
> Uhm, I wasn't talking about the builtin codecs only (of course,
> we can change those to our liking). I'm after a generic
> interface for stateful codecs.

But that interface is only between the StreamReader
and any helper function that the codec implementer
might want to use. If there ise no helper function
there is no interface.

>>> 2. we don't want the state to be altered by the user
>>
>> We are all consenting adults, and we can't *really*
>> prevent it, anyway. For example, the user may pass an
>> old state, or a state originating from a different codec
>> (instance). We need to support this gracefully (i.e. with
>> a proper Python exception).
> 
> True, but the codec writer should be in control of the
> state object, its format and what the user can or cannot
> change.

Yes, we should not dictate, how, why or if the codec
implementer has to pass around any state. The only thing
we have to dictate is that StreamReaders have to keep their
state between calls to read().

>>> A single object serves this best and does not create
>>> a whole plethora of new APIs in the _codecs module.
>>> This is not over-design, but serves a reason.
>>
>> It does put a burden on codec developers, which need
>> to match the "official" state representation policy.
>> Of course, if they are allowed to return a tuple
>> representing their state, that would be fine with
>> me.
> 
> They can use any object they like to keep the state
> in whatever format they choose. I think this makes it
> easier on the codec writer, rather than harder.
> 
> Furthermore, they can change the way they store state
> e.g. to accomodate for new features they may want to
> add to the codec, without breaking the interface.

That's basically the current state of the codec machinery,
so we don't have to change anything in the specification.

BTW, I hope that I can add updated documentation to the
patch tomorrow (for PyUnicode_DecodeUTF8Stateful() and
friends and for the additional arguments to read()),
because I'll be on vacation the next week.

Bye,
    Walter Dörwald