[Python-Dev] Decoding incomplete unicode

Thu Aug 19 18:49:44 CEST 2004

Martin v. Löwis wrote:

> Walter Dörwald wrote:
> 
>> But then a file that contains the two bytes 0x61, 0xc3
>> will never generate an error when read via an UTF-8 reader.
>> The trailing 0xc3 will just be ignored.
>>
>> Another option we have would be to add a final() method
>> to the StreamReader, that checks if all bytes have been
>> consumed. 
> 
> Alternatively, we could add a .buffer() method that returns
> any data that are still pending (either a Unicode string or
> a byte string).

Both approaches have one problem: Error handling won't
work for them. If the error handling is "replace", the decoder
should return U+FFFD for the final trailing incomplete sequence
in read(). This won't happen when read() never reads those
bytes from the input stream.

>> Maybe this should be done by StreamReader.close()?
> 
> No. There is nothing wrong with only reading a part of a file.

Yes, but if read() is called without arguments then everything
from the input stream should be read and used.

>> Now
>> inShift counts the number of characters (and the shortcut
>> for a "+-" sequence appearing together has been removed.
> 
> Ok. I didn't actually check the correctness of the individual
> methods.
> 
> OTOH, I think time spent on UTF-7 is wasted, anyway.

;) But it's a good example of how complicated state
management can get.

>> Would a version of the patch without a final argument but
>> with a feed() method be accepted?
> 
> I don't see the need for a feed method. .read() should just
> block until data are available, and that's it.

There are situations where this can never work: Take a look
at xml.sax.xmlreader.IncrementalParser. This interface
has a feed() method which the user can call multiple times
to pass byte string chunks to the XML parser. These chunks
have to be decoded by the parser. Now if the parser wants
to use Python's StreamReader it has to wrap the bytes passed
to the feed() method into a stream interface. This looks
something like the Queue class from the patch:

class Queue(object):
     def __init__(self):
         self._buffer = ""

     def write(self, chars):
         self._buffer += chars

     def read(self, size=-1):
         if size<0:
             s = self._buffer
             self._buffer = ""
             return s
         else:
             s = self._buffer[:size]
             self._buffer = self._buffer[size:]
             return s

The parser creates such an object and passes it to the
StreamReader constructor. Now when feed() is called for
the XML parser the parser calls queue.write(bytes) to
put the bytes into the queue. Now the parser can call
read() on the StreamReader (which in turn will read
from the queue (on the other end)) to get decoded data.

But this will fail when StreamReader.read() block
until more data is available. This will never happen,
because the data will be put in the queue explicitely
by calls to the feed() method of the XML parser.

Or take a look at sio.DecodingInputFilter. This should
be an alternative implementation of reading a stream
an decoding bytes to unicode. But the current implementation
is broken because it uses the stateless API. But once
we switch to the stateful API DecodingInputFilter becomes
useless: DecodingInputFilter.read() just looks like this:
def read():
    return self.stream.read()
(with stream being the stateful stream reader from
codecs.getreader()), because DecodingInputFilter is
forced to use the stream API of StreamReader.

>> I'm imagining implementing an XML parser that uses Python's
>> unicode machinery and supports the
>> xml.sax.xmlreader.IncrementalParser interface.
> 
> I think this is out of scope of this patch. The incremental
> parser could implement a regular .read on a StringIO file
> that also supports .feed.

This adds to much infrastructure, when the alternative
implementation is trivial. Take a look at the first
version of the patch. Implementing a feed() method just
mean factoring the lines:

data = self.bytebuffer + newdata
object, decodedbytes = self.decode(data, self.errors)
self.bytebuffer = data[decodedbytes:]

into a separate method named feed():

def feed(newdata):
    data = self.bytebuffer + newdata
    object, decodedbytes = self.decode(data, self.errors)
    self.bytebuffer = data[decodedbytes:]
    return object

So the feed functionality does already exist. It's just
not in a usable form.

A using StringIO wouldn't work because we need both
a read and a write position.

>> Without the feed method(), we need the following:
>>
>> 1) A StreamQueue class that
>
> Why is that? I thought we are talking about "Decoding
> incomplete unicode"?

Well, I had to choose a subject. ;)

Bye,
    Walter Dörwald