[Python-Dev] Decoding incomplete unicode
Walter Dörwald
walter at livinglogic.de
Wed Jul 28 17:13:17 CEST 2004
Hye-Shik Chang wrote:
> [...]
>>BTW, how do you solve the problem that incomplete byte sequences
>>are retained in the middle of a stream, but should generate errors
>>at the end?
>
> Rough pseudo code here: (it's written in C in CJKCodecs)
>
> class StreamReader:
>
> pending = '' # incomplete
>
> def read(self, size=-1):
> while True:
> r = fp.read(size)
> if self.pending:
> r = self.pending + r
> self.pending = ''
>
> if r:
> try:
> outputbuffer = r.decode('utf-8')
> except MBERR_TOOFEW: # incomplete multibyte sequence
> pass
> except MBERR_ILLSEQ: # illegal sequence
> raise UnicodeDecodeError, "illseq"
>
> if not r or size == -1: # end of the stream
> if r have not consumed up for the output:
> raise UnicodeDecodeError, "toofew"
Here's the problem: I'd like the streamreader to be able
to continue even when there is no input available *now*.
Perhaps there should be an additional argument to read()
named final? If final is true, the stream reader makes
sure that all pending bytes have been used up.
> if r have not consumed up for the output:
> self.pending = remainders of r
>
> if (size == -1 or # one time read up
> len(outputbuffer) > 0 or # output buffer isn't empty
> original length of r == 0): # the end of the stream
> break
>
> size = 1 # read 1 byte in next try
>
> return outputbuffer
>
>
> CJKcodecs' multibytecodec structure has distinguished internal error
> codes for "illegal sequence" and "incomplete sequence". And each
> internal codecs receive a flag that indicates if immediate flush
> is needed at time. (for the end of streams and simple decode functions)
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list