[Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Adam Bartoš drekin at gmail.com
Sat Sep 3 05:48:58 EDT 2016


Paul Moore (p.f.moore at gmail.com) on Fri Sep 2 05:23:04 EDT 2016 wrote

>
> On 2 September 2016 at 03:35, Steve Dower <steve.dower at python.org <https://mail.python.org/mailman/listinfo/python-dev>> wrote:
> >* I'd need to test to be sure, but writing an incomplete code point should
> *>* just truncate to before that point. It may currently raise OSError if that
> *>* truncated to zero length, as I believe that's not currently distinguished
> *>* from an error. What behavior would you propose?
> *
> For "correct" behaviour, you should retain the unwritten bytes, and
> write them as part of the next call (essentially making the API
> stateful, in the same way that incremental codecs work). I'm pretty
> sure that this could cause actual problems, for example I think invoke
> (https://github.com/pyinvoke/invoke) gets byte streams from
> subprocesses and dumps them direct to stdout in blocks (so could
> easily end up splitting multibyte sequences). It''s arguable that it
> should be decoding the bytes from the subprocess and then re-encoding
> them, but that gets us into "guess the encoding used by the
> subprocess" territory.
>
> The problem is that we're not going to simply drop some bad data in
> the common case - it's not so much the dropping of the start of an
> incomplete code point that bothers me, as the encoding error you hit
> at the start of the *next* block of data you send. So people will get
> random, unexplained, encoding errors.
>
> I don't see an easy answer here other than a stateful API.
>
>
Isn't the buffered IO wrapper for this?



> >* Reads of less than four bytes fail instantly, as in the worst case we need
> *>* four bytes to represent one Unicode character. This is an unfortunate
> *>* reality of trying to limit it to one system call - you'll never get a full
> *>* buffer from a single read, as there is no simple mapping between
> *>* length-as-utf8 and length-as-utf16 for an arbitrary string.
> *
> And here - "read a single byte" is a not uncommon way of getting some
> data. Once again see invoke:
> https://github.com/pyinvoke/invoke/blob/master/invoke/platform.py#L147
>
> used at
> https://github.com/pyinvoke/invoke/blob/master/invoke/runners.py#L548
>
> I'm not saying that there's an easy answer here, but this *will* break
> code. And actually, it's in violation of the documentation: seehttps://docs.python.org/3/library/io.html#io.RawIOBase.read
>
> """
> read(size=-1)
>
> Read up to size bytes from the object and return them. As a
> convenience, if size is unspecified or -1, readall() is called.
> Otherwise, only one system call is ever made. Fewer than size bytes
> may be returned if the operating system call returns fewer than size
> bytes.
>
> If 0 bytes are returned, and size was not 0, this indicates end of
> file. If the object is in non-blocking mode and no bytes are
> available, None is returned.
> """
>
> You're not allowed to return 0 bytes if the requested size was not 0,
> and you're not at EOF.
>
>

That's why it should be rather signaled by an exception. Even when one
doesn't transcode UTF-16 to UTF-8, reading just one byte is still
impossible I would argue that also incorrect here. I raise ValueError in
win_unicode_console.


Adam Bartoš
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20160903/4f714c1d/attachment.html>


More information about the Python-Dev mailing list