[Python-Dev] Decoding incomplete unicode

Wed Jul 28 17:13:17 CEST 2004

Hye-Shik Chang wrote:

> [...]
>>BTW, how do you solve the problem that incomplete byte sequences
>>are retained in the middle of a stream, but should generate errors
>>at the end?
> 
> Rough pseudo code here: (it's written in C in CJKCodecs)
> 
> class StreamReader:
> 
>     pending = '' # incomplete 
> 
>     def read(self, size=-1):
>         while True:
>             r = fp.read(size)
>             if self.pending:
>                 r = self.pending + r
>                 self.pending = ''
> 
>             if r:
>                 try:
>                     outputbuffer = r.decode('utf-8')
>                 except MBERR_TOOFEW: # incomplete multibyte sequence
>                     pass
>                 except MBERR_ILLSEQ: # illegal sequence
>                     raise UnicodeDecodeError, "illseq"
> 
>             if not r or size == -1: # end of the stream
>                 if r have not consumed up for the output:
>                     raise UnicodeDecodeError, "toofew"

Here's the problem: I'd like the streamreader to be able
to continue even when there is no input available *now*.
Perhaps there should be an additional argument to read()
named final? If final is true, the stream reader makes
sure that all pending bytes have been used up.

>             if r have not consumed up for the output:
>                 self.pending = remainders of r
> 
>             if (size == -1 or               # one time read up
>                 len(outputbuffer) > 0 or    # output buffer isn't empty
>                 original length of r == 0): # the end of the stream
>                     break
> 
>             size = 1 # read 1 byte in next try
> 
>         return outputbuffer
> 
> 
> CJKcodecs' multibytecodec structure has distinguished internal error
> codes for "illegal sequence" and "incomplete sequence".  And each
> internal codecs receive a flag that indicates if immediate flush
> is needed at time.  (for the end of streams and simple decode functions)

Bye,
    Walter Dörwald