[Python-Dev] Decoding incomplete unicode

Walter Dörwald walter at livinglogic.de
Thu Aug 19 17:45:26 CEST 2004


Martin v. Löwis wrote:

> Walter Dörwald wrote:
> 
>> They will not, because StreamReader.decode() already is a feed
>> style API (but with state amnesia).
>>
>> Any stream decoder that I can think of can be (and most are)
>> implemented by overwriting decode().
> 
> I consider that an unfortunate implementation artefact. You
> either use the stateless encode/decode that you get from
> codecs.get(encoder/decoder) or you use the file API on
> the streams. You never ever use encode/decode on streams.

That is exactly the problem with the current API.
StreamReader mixes two concepts:

1) The stateful API, which allows decoding a byte input
    in chunk and the state of the decoder is kept between
    calls.
2) A file API where the chunks to be decoded are read
    from a byte stream.

> I would have preferred if the default .write implementation
> would have called self._internal_encode, and the Writer
> would *contain* a Codec, rather than inheriting from Codec.

This would separate the two concepts from above.

> Alas, for (I guess) simplicity, a more direct (and more
> confusing) approach was taken.
> 
>> 1) Having feed() as part of the StreamReader API:
>> ---
>> s = u"???".encode("utf-8")
>> r = codecs.getreader("utf-8")()
>> for c in s:
>>    print r.feed(c)
> 
> 
> Isn't that a totally unrelated issue? Aren't we talking about
> short reads on sockets etc?

We're talking about two problems:

1) The current implementation does not really support the
    stateful API, because trailing incomplete byte sequences
    lead to errors.
2) The current file API is not really convenient for decoding
    when the input is not read for a stream.

> I would very much prefer to solve one problem at a time.

Bye,
    Walter Dörwald




More information about the Python-Dev mailing list