[Python-Dev] Decoding incomplete unicode

Wed Aug 18 22:35:31 CEST 2004

M.-A. Lemburg wrote:

> Walter Dörwald wrote:
> 
>>> I've thought about this some more. Perhaps I'm still missing
>>> something, but wouldn't it be possible to add a feeding
>>> mode to the existing stream codecs by creating a new queue
>>> data type (much like the queue you have in the test cases of
>>> your patch) and using the stream codecs on these ?
>>
>>
>> No, because when the decode method encounters an incomplete
>> chunk (and so return a size that is smaller then size of the
>> input) read() would have to push the remaining bytes back into
>> the queue. This would be code similar in functionality
>> to the feed() method from the patch, with the difference that
>> the buffer lives in the queue not the StreamReader. So
>> we won't gain any code simplification by going this route.
> 
> Maybe not code simplification, but the APIs will be well-
> separated.

They will not, because StreamReader.decode() already is a feed
style API (but with state amnesia).

Any stream decoder that I can think of can be (and most are)
implemented by overwriting decode().

> If we require the queue type for feeding mode operation
> we are free to define whatever APIs are needed to communicate
> between the codec and the queue type, e.g. we could define
> a method that pushes a few bytes back onto the queue end
> (much like ungetc() in C).

That would of course be a possibility.

>>> I think such a queue would be generally useful in other
>>> contexts as well, e.g. for implementing fast character based
>>> pipes between threads, non-Unicode feeding parsers, etc.
>>> Using such a type you could potentially add a feeding
>>> mode to stream or file-object API based algorithms very
>>> easily.
>>
>> Yes, so we could put this Queue class into a module with
>> string utilities. Maybe string.py?
> 
> Hmm, I think a separate module would be better since we
> could then recode the implementation in C at some point
> (and after the API has settled).
> We'd only need a new name for it, e.g. StreamQueue or
> something.

Sounds reasonable.

>>> We could then have a new class, e.g. FeedReader, which
>>> wraps the above in a nice API, much like StreamReaderWriter
>>> and StreamRecoder.
>>
>> But why should we, when decode() does most of what we need,
>> and the rest has to be implemented in both versions?
> 
> To hide the details from the user. It should be possible
> to instantiate one of these StreamQueueReaders (named
> after the queue type) and simply use it in feeding
> mode without having to bother about the details behind
> the implementation.
> 
> StreamReaderWriter and StreamRecoder exist for the same
> reason.

Let's compare example uses:

1) Having feed() as part of the StreamReader API:
---
s = u"???".encode("utf-8")
r = codecs.getreader("utf-8")()
for c in s:
    print r.feed(c)
---
2) Explicitely using a queue object:
---
from whatever import StreamQueue

s = u"???".encode("utf-8")
q = StreamQueue()
r = codecs.getreader("utf-8")(q)
for c in s:
    q.write(c)
    print r.read()
---
3) Using a special wrapper that implicitely creates a queue:
----
from whatever import StreamQueueWrapper
s = u"???".encode("utf-8")
r = StreamQueueWrapper(codecs.getreader("utf-8"))
for c in s:
    print r.feed(c)
----

I very much prefer option 1).

"If the implementation is hard to explain, it's a bad idea."

Bye,
    Walter Dörwald