[Python-Dev] Decoding incomplete unicode

Wed Aug 18 23:17:31 CEST 2004

Martin v. Löwis wrote:

>> We do need to extend the API between the stream codec
>> and the encode/decode functions, no doubt about that.
>> However, this is an extension that is well hidden from
>> the user of the codec and won't break code.
> 
> So you agree to the part of Walter's change that introduces
> new C functions (PyUnicode_DecodeUTF7Stateful etc)?
> 
> I think most of the patch can be discarded: there is no
> need for .encode and .decode to take an additional argument.

But then a file that contains the two bytes 0x61, 0xc3
will never generate an error when read via an UTF-8 reader.
The trailing 0xc3 will just be ignored.

Another option we have would be to add a final() method
to the StreamReader, that checks if all bytes have been
consumed. Maybe this should be done by StreamReader.close()?

> It is only necessary that the StreamReader and StreamWriter
> are stateful, and that only for a selected subset of codecs.
> 
> Marc-Andre, if the original patch (diff.txt) was applied:
> What *specific* change in that patch would break code?
> What *specific* code (C or Python) would break under that
> change?
> 
> I believe the original patch can be applied as-is, and
> does not cause any breakage.

The first version has a broken implementation of the
UTF-7 decoder. When decoding the byte sequence "+-"
in two calls to decode() (i.e. pass "+" in one call and
"-" in the next), no character got generated, because
inShift (as a flag) couldn't remember whether characters
where encountered between the "+" and the "-". Now
inShift counts the number of characters (and the shortcut
for a "+-" sequence appearing together has been removed.

> It also introduces a change
> between the codec and the encode/decode functions that is
> well hidden from the user of the codec.

Would a version of the patch without a final argument but
with a feed() method be accepted?

I'm imagining implementing an XML parser that uses Python's
unicode machinery and supports the
xml.sax.xmlreader.IncrementalParser interface.

With a feed() method in the stream reader this is rather simple:

init()
{
    PyObject *reader = PyCodec_StreamReader(encoding, Py_None, NULL);
    self.reader = PyObject_CallObject(reader, NULL);
}

int feed(char *bytes)
{
     parse(PyObject_CallMethod(self.reader, "feed", "s", bytes);
}

The feed method itself is rather simple (see the second
version of the patch).

Without the feed method(), we need the following:

1) A StreamQueue class that
    a) supports writing at one end and reading at the other end
    b) has a method for pushing back unused bytes to be returned
       in the next call to read()

2) A StreamQueueWrapper class that
    a) gets passed the StreamReader factory in the constructor,
       creates a StreamQueue instance, puts it into an attribute
       and passes it to the StreamReader factory (which must also
       be put into an attribute).
    b) has a feed() method that calls write() on the stream queue
       and read() on the stream reader and returns the result

Then the C implementation of the parser looks something like this:

init()
{
    PyObject *module = PyImport_ImportModule("whatever");
    PyObject *wclass = PyObject_GetAttr(module, "StreamQueueWrapper");
    PyObject *reader = PyCodec_StreamReader(encoding, Py_None, NULL);
    self.wrapper = PyObject_CallObject(wclass, reader);
}

int feed(char *bytes)
{
     parse(PyObject_CallMethod(self.wrapper, "feed", "s", bytes);
}

I find this neither easier to implement nor easier to explain.

Bye,
    Walter Dörwald