[Python-Dev] Decoding incomplete unicode
Walter Dörwald
walter at livinglogic.de
Wed Jul 28 18:55:10 CEST 2004
M.-A. Lemburg wrote:
> Walter Dörwald wrote:
>
>> This is the correct thing to do for the stateless decoders:
>> any incomplete byte sequence at the end of the input is an
>> error. But then it doesn't make sense to return the number
>> of bytes decoded for the stateless decoder, because this is
>> always the size of the input.
>
> The reason why stateless encode and decode APIs return the
> number of input items consumed is to accomodate for error
> handling situations like these where you want to stop
> coding and leave the remaining work to another step.
Which in most cases is the read method.
> The C implementation currently doesn't make use of this
> feature.
>
>> For the stateful decoder this
>> is just some sort of state common to all decoders: the decoder
>> keeps the incomplete byte sequence to be used in the next call.
>> But then this state should be internal to the decoder and not
>> part of the public API. This would make the decode() method
>> more usable: When you want to implement an XML parser that
>> supports the xml.sax.xmlreader.IncrementalParser interface,
>> you have an API mismatch. The parser has to use the stateful
>> decoding API (i.e. read()), which means the input is in the
>> form of a byte stream, but this interface expects it's input
>> as byte chunks passed to multiple calls to the feed() method.
>> If StreamReader.decode() simply returned the decoded unicode
>> object and keep the remaining undecoded bytes as an internal
>> state then StreamReader.decode() would be directly usable.
>
>
> Please don't mix "StreamReader" with "decoder". The codecs
> module returns 4 different objects if you ask it for
> a codec set: two stateless APIs for encoding and decoding
> and two factory functions for creating possibly stateful
> objects which expose a stream interface.
>
> Your "stateful decoder" is actually part of a StreamReader
> implementation and doesn't have anything to do with the
> stateless decoder.
I know. I'd just like to have a stateful decoder that
doesn't use a stream interface. The stream interface
could be built on top of that without any knowlegde
of the encoding.
I wonder whether the decode method is part of the public
API for StreamReader.
> I see two possibilities here:
>
> 1. you write a custom StreamReader/Writer implementation
> for each of the codecs which takes care of keeping
> state and encoding/decoding as much as possible
But I'd like to reuse at least some of the functionality
from PyUnicode_DecodeUTF8() etc.
Would a version of PyUnicode_DecodeUTF8() with an additional
PyUTF_DecoderState * be OK?
> 2. you extend the existing stateless codec implementations
> to allow communicating state on input and output; the
> stateless operation would then be a special case
>
>> But this isn't really a "StreamReader" any more, so the best
>> solution would probably be to have a three level API:
>> * A stateless decoding function (what codecs.getdecoder
>> returns now);
>> * A stateful "feed reader", which keeps internal state
>> (including undecoded byte sequences) and gets passed byte
>> chunks (should this feed reader have a error attribute or
>> should this be an argument to the feed method?);
>> * A stateful stream reader that reads its input from a
>> byte stream. The functionality for the stream reader could
>> be implemented once using the underlying functionality of
>> the feed reader (in fact we could implement something similar
>> to sio's stacking streams: the stream reader would use
>> the feed reader to wrap the byte input stream and add
>> only a read() method. The line reading methods (readline(),
>> readlines() and __iter__() could be added by another stream
>> filter)
>
> Why make things more complicated ?
>
> If you absolutely need a feed interface, you can feed
> your data to a StringIO instance which is then read from
> by StreamReader.
This doesn't work, because a StringIO has only one file position:
>>> import cStringIO
>>> s = cStringIO.StringIO()
>>> s.write("x")
>>> s.read()
''
But something like the Queue class from the tests in the patch
might work.
>>> The error callbacks could, however, raise an exception which
>>> includes all the needed information, including any state that
>>> may be needed in order to continue with coding operation.
>>
>> This makes error callbacks effectively unusable with stateful
>> decoders.
>
> Could you explain ?
If you have to call the decode function with errors='break',
you will only get the break error handling and nothing else.
>>> We may then need to allow additional keyword arguments on the
>>> encode/decode functions in order to preset a start state.
>>
>> As those decoding functions are private to the decoder that's
>> probably OK. I wouldn't want to see additional keyword arguments
>> on str.decode (which uses the stateless API anyway). BTW, that's
>> exactly what I did for codecs.utf_7_decode_stateful, but I'm not
>> really comfortable with the internal state of the UTF-7 decoder
>> being exposed on the Python level. It would be better to encapsulate
>> the state in a feed reader implemented in C, so that the state is
>> inaccessible from the Python level.
>
> See above: possibility 1 would be the way to go then.
I might give this a try.
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list