[Python-Dev] Decoding incomplete unicode

M.-A. Lemburg mal at egenix.com
Wed Aug 18 10:36:06 CEST 2004


Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
> 
>> I've thought about this some more. Perhaps I'm still missing
>> something, but wouldn't it be possible to add a feeding
>> mode to the existing stream codecs by creating a new queue
>> data type (much like the queue you have in the test cases of
>> your patch) and using the stream codecs on these ?
> 
> Here is the problem. In UTF-8, how does the actual algorithm
> tell (the application) that the bytes it got on decoding provide
> for three fully decodable characters, and that 2 bytes are left
> undecoded, and that those bytes are not inherently ill-formed,
> but lack a third byte to complete the multi-byte sequence?

This state can be stored in the stream codec instance,
e.g. by using a special state object that is stored in
the instance and passed to the encode/decode APIs of the
codec or by implementing the stream codec itself in C.

We do need to extend the API between the stream codec
and the encode/decode functions, no doubt about that.
However, this is an extension that is well hidden from
the user of the codec and won't break code.

> On top of that, you can implement whatever queuing or streaming
> APIs you want, but you *need* an efficient way to communicate
> incompleteness.

Agreed.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 18 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list