[Python-Dev] Decoding incomplete unicode

Walter Dörwald walter at livinglogic.de
Thu Aug 19 20:09:17 CEST 2004


M.-A. Lemburg wrote:

> Walter Dörwald wrote:
> 
>> Let's compare example uses:
>>
>> 1) Having feed() as part of the StreamReader API:
>> ---
>> s = u"???".encode("utf-8")
>> r = codecs.getreader("utf-8")()
>> for c in s:
>>    print r.feed(c)
>> ---
> 
> I consider adding a .feed() method to the stream codec
> bad design. .feed() is something you do on a stream, not
> a codec.

I don't care about the name, we can call it
stateful_decode_byte_chunk() or whatever. (In fact I'd
prefer to call it decode(), but that name is already
taken by another method. Of course we could always
rename decode() to _internal_decode() like Martin
suggested.)

>> 2) Explicitely using a queue object:
>> ---
>> from whatever import StreamQueue
>>
>> s = u"???".encode("utf-8")
>> q = StreamQueue()
>> r = codecs.getreader("utf-8")(q)
>> for c in s:
>>    q.write(c)
>>    print r.read()
>> ---
> 
> This is probably how an advanced codec writer would use the APIs
> to build new stream interfaces.
 >
>> 3) Using a special wrapper that implicitely creates a queue:
>> ----
>> from whatever import StreamQueueWrapper
>> s = u"???".encode("utf-8")
>> r = StreamQueueWrapper(codecs.getreader("utf-8"))
>> for c in s:
>>    print r.feed(c)
>> ----
> 
> 
> This could be turned into something more straight forward,
> e.g.
> 
> from codecs import EncodedStream
> 
> # Load data
> s = u"???".encode("utf-8")
> 
> # Write to encoded stream (one byte at a time) and print
> # the read output
> q = EncodedStream(input_encoding="utf-8", output_encoding="unicode")

This is confusing, because there is no encoding named "unicode".
This should probably read:

q = EncodedQueue(encoding="utf-8", errors="strict")

> for c in s:
>    q.write(c)
>    print q.read()
> 
> # Make sure we have processed all data:
> if q.has_pending_data():
>    raise ValueError, 'data truncated'

This should be the job of the error callback, the last part should
probably be:

for c in s:
    q.write(c)
    print q.read()
print q.read(final=True)

>> I very much prefer option 1).
> 
> I prefer the above example because it's easy to read and
> makes things explicit.
> 
>> "If the implementation is hard to explain, it's a bad idea."
> 
> The user usually doesn't care about the implementation, only it's
> interfaces.

Bye,
    Walter Dörwald




More information about the Python-Dev mailing list