[Python-Dev] Decoding incomplete unicode

"Martin v. Löwis" martin at v.loewis.de
Wed Aug 18 23:57:22 CEST 2004


Walter Dörwald wrote:
> But then a file that contains the two bytes 0x61, 0xc3
> will never generate an error when read via an UTF-8 reader.
> The trailing 0xc3 will just be ignored.
> 
> Another option we have would be to add a final() method
> to the StreamReader, that checks if all bytes have been
> consumed. 

Alternatively, we could add a .buffer() method that returns
any data that are still pending (either a Unicode string or
a byte string).

> Maybe this should be done by StreamReader.close()?

No. There is nothing wrong with only reading a part of a file.

> Now
> inShift counts the number of characters (and the shortcut
> for a "+-" sequence appearing together has been removed.

Ok. I didn't actually check the correctness of the individual
methods.

OTOH, I think time spent on UTF-7 is wasted, anyway.

> Would a version of the patch without a final argument but
> with a feed() method be accepted?

I don't see the need for a feed method. .read() should just
block until data are available, and that's it.

> I'm imagining implementing an XML parser that uses Python's
> unicode machinery and supports the
> xml.sax.xmlreader.IncrementalParser interface.

I think this is out of scope of this patch. The incremental
parser could implement a regular .read on a StringIO file
that also supports .feed.

> Without the feed method(), we need the following:
> 
> 1) A StreamQueue class that

Why is that? I thought we are talking about "Decoding
incomplete unicode"?

Regards,
Martin


More information about the Python-Dev mailing list