[Python-Dev] Decoding incomplete unicode

Thu Aug 19 11:34:19 CEST 2004

Walter Dörwald wrote:
> Martin v. Löwis wrote:
> 
>> M.-A. Lemburg wrote:
>>
>>> I've thought about this some more. Perhaps I'm still missing
>>> something, but wouldn't it be possible to add a feeding
>>> mode to the existing stream codecs by creating a new queue
>>> data type (much like the queue you have in the test cases of
>>> your patch) and using the stream codecs on these ?
>>
>>
>> Here is the problem. In UTF-8, how does the actual algorithm
>> tell (the application) that the bytes it got on decoding provide
>> for three fully decodable characters, and that 2 bytes are left
>> undecoded, and that those bytes are not inherently ill-formed,
>> but lack a third byte to complete the multi-byte sequence?
>>
>> On top of that, you can implement whatever queuing or streaming
>> APIs you want, but you *need* an efficient way to communicate
>> incompleteness.
> 
> 
> We already have an efficient way to communicate incompleteness:
> the decode method returns the number of decoded bytes.
> 
> The questions remaining are
> 
> 1) communicate to whom? IMHO the info should only be used
>    internally by the StreamReader.

Handling incompleteness should be something for the codec
to deal with. The queue doesn't have to know about it in an
way. However, the queue should have interfaces allowing the
codec to tell whether there are more bytes waiting to be
processed.

> 2) When is incompleteness OK? Incompleteness is of course
>    not OK in the stateless API. For the stateful API,
>    incompleteness has to be OK even when the input stream
>    is (temporarily) exhausted, because otherwise a feed mode
>    wouldn't work anyway. But then incompleteness is always OK,
>    because the StreamReader can't distinguish a temporarily
>    exhausted input stream from a permanently exhausted one.
>    The only fix for this I can think of is the final argument.

A final argument may be the way to go. But it should be an
argument for the .read() method (not only the .decode() method)
since that's the method reading the data from the queue.

I'd suggest that we extend the existing encode and decode
codec APIs to take an extra state argument that holds the
codec state in whatever format the codec needs (e.g. this
could be a tuple or a special object):

encode(data, errors='strict', state=None)
decode(data, errors='strict', state=None)

In the case of the .read() method, decode() would be
called. If the returned length_consumed does not match
the length of the data input, the remaining items would
have to be placed back onto the queue in non-final mode.
In final mode an exception would be raised to signal
the problem.

I think it's PEP time for this new extension. If time
permits I'll craft an initial version over the weekend.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 19 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::