[Python-Dev] Decoding incomplete unicode

Wed Jul 28 18:55:10 CEST 2004

M.-A. Lemburg wrote:

> Walter Dörwald wrote:
> 
>> This is the correct thing to do for the stateless decoders:
>> any incomplete byte sequence at the end of the input is an
>> error. But then it doesn't make sense to return the number
>> of bytes decoded for the stateless decoder, because this is
>> always the size of the input. 
> 
> The reason why stateless encode and decode APIs return the
> number of input items consumed is to accomodate for error
> handling situations like these where you want to stop
> coding and leave the remaining work to another step.

Which in most cases is the read method.

> The C implementation currently doesn't make use of this
> feature.
> 
>> For the stateful decoder this
>> is just some sort of state common to all decoders: the decoder
>> keeps the incomplete byte sequence to be used in the next call.
>> But then this state should be internal to the decoder and not
>> part of the public API. This would make the decode() method
>> more usable: When you want to implement an XML parser that
>> supports the xml.sax.xmlreader.IncrementalParser interface,
>> you have an API mismatch. The parser has to use the stateful
>> decoding API (i.e. read()), which means the input is in the
>> form of a byte stream, but this interface expects it's input
>> as byte chunks passed to multiple calls to the feed() method.
>> If StreamReader.decode() simply returned the decoded unicode
>> object and keep the remaining undecoded bytes as an internal
>> state then StreamReader.decode() would be directly usable.
> 
> 
> Please don't mix "StreamReader" with "decoder". The codecs
> module returns 4 different objects if you ask it for
> a codec set: two stateless APIs for encoding and decoding
> and two factory functions for creating possibly stateful
> objects which expose a stream interface.
> 
> Your "stateful decoder" is actually part of a StreamReader
> implementation and doesn't have anything to do with the
> stateless decoder.

I know. I'd just like to have a stateful decoder that
doesn't use a stream interface. The stream interface
could be built on top of that without any knowlegde
of the encoding.

I wonder whether the decode method is part of the public
API for StreamReader.

> I see two possibilities here:
> 
> 1. you write a custom StreamReader/Writer implementation
>    for each of the codecs which takes care of keeping
>    state and encoding/decoding as much as possible

But I'd like to reuse at least some of the functionality
from PyUnicode_DecodeUTF8() etc.

Would a version of PyUnicode_DecodeUTF8() with an additional
PyUTF_DecoderState * be OK?

> 2. you extend the existing stateless codec implementations
>    to allow communicating state on input and output; the
>    stateless operation would then be a special case
> 
>> But this isn't really a "StreamReader" any more, so the best
>> solution would probably be to have a three level API:
>> * A stateless decoding function (what codecs.getdecoder
>>   returns now);
>> * A stateful "feed reader", which keeps internal state
>>   (including undecoded byte sequences) and gets passed byte
>>   chunks (should this feed reader have a error attribute or
>>   should this be an argument to the feed method?);
>> * A stateful stream reader that reads its input from a
>>   byte stream. The functionality for the stream reader could
>>   be implemented once using the underlying functionality of
>>   the feed reader (in fact we could implement something similar
>>   to sio's stacking streams: the stream reader would use
>>   the feed reader to wrap the byte input stream and add
>>   only a read() method. The line reading methods (readline(),
>>   readlines() and __iter__() could be added by another stream
>>   filter)
> 
> Why make things more complicated ?
> 
> If you absolutely need a feed interface, you can feed
> your data to a StringIO instance which is then read from
> by StreamReader.

This doesn't work, because a StringIO has only one file position:
 >>> import cStringIO
 >>> s = cStringIO.StringIO()
 >>> s.write("x")
 >>> s.read()
''

But something like the Queue class from the tests in the patch
might work.

>>> The error callbacks could, however, raise an exception which
>>> includes all the needed information, including any state that
>>> may be needed in order to continue with coding operation.
>>
>> This makes error callbacks effectively unusable with stateful
>> decoders.
> 
> Could you explain ?

If you have to call the decode function with errors='break',
you will only get the break error handling and nothing else.

>>> We may then need to allow additional keyword arguments on the
>>> encode/decode functions in order to preset a start state.
>>
>> As those decoding functions are private to the decoder that's
>> probably OK. I wouldn't want to see additional keyword arguments
>> on str.decode (which uses the stateless API anyway). BTW, that's
>> exactly what I did for codecs.utf_7_decode_stateful, but I'm not
>> really comfortable with the internal state of the UTF-7 decoder
>> being exposed on the Python level. It would be better to encapsulate
>> the state in a feed reader implemented in C, so that the state is
>> inaccessible from the Python level.
> 
> See above: possibility 1 would be the way to go then.

I might give this a try.

Bye,
    Walter Dörwald