[issue28162] WindowsConsoleIO readall() fails if first line starts with Ctrl+Z

Wed Sep 21 01:46:33 EDT 2016

Eryk Sun added the comment:

For breaking out of the readall while loop, you only need to check if the current read is empty:

        /* when the read is empty we break */
        if (n == 0)
            break;

Also, the logic is wrong here:

    if (len == 0 || buf[0] == '\x1a' && _buflen(self) == 0) {
        /* when the result starts with ^Z we return an empty buffer */
        PyMem_Free(buf);
        return PyBytes_FromStringAndSize(NULL, 0);
    }

This is true when len is 0 or when buf[0] is Ctrl+Z and _buflen(self) is 0. Since buf[0] shouldn't ever be Ctrl+Z here (low-level EOF handling is abstracted in read_console_w), it's never checking the internal buffer. We can easily see this going wrong here:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    Ā^Z
    >>> a
    b'\xc4'
    >>> b
    b''

It misses the remaining byte in the internal buffer.

This check can be simplified as follows:

    rn = _buflen(self);

    if (len == 0 && rn == 0) {
        /* return an empty buffer */
        PyMem_Free(buf);
        return PyBytes_FromStringAndSize(NULL, 0);
    }

After this the code assumes that len isn't 0, which leads to more WideCharToMultiByte failure cases. 

In the last conversion it's overwrite bytes_size without including rn. 

I'm not sure what's going on with _PyBytes_Resize(&bytes, n * sizeof(wchar_t)). ISTM, it should be resized to bytes_size, and make sure this includes rn.

Finally, _copyfrombuf is repeatedly overwriting buf[0] instead of writing to buf[n]. 

With the attached patch, the behavior seems correct now:

    >>> sys.stdin.buffer.raw.read()
    ^Z
    b''

    >>> sys.stdin.buffer.raw.read()
    abc^Z
    ^Z
    b'abc\x1a\r\n'

Split U+0100:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    Ā^Z
    >>> a
    b'\xc4'
    >>> b
b'\x80'

Split U+1234:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    ሴ^Z
    >>> a
    b'\xe1'
    >>> b
    b'\x88\xb4'

The buffer still can't handle splitting an initial non-BMP character, stored as a surrogate pair. Both codes end up as replacement characters because they aren't transcoded as a unit.

Split U+00010000:

    >>> a = sys.stdin.buffer.raw.read(1); b = sys.stdin.buffer.raw.read()
    𐀀^Z
    ^Z
    >>> a
    b'\xef'
    >>> b
    b'\xbf\xbd\xef\xbf\xbd\x1a\r\n'

----------
keywords: +patch
status: closed -> open
Added file: http://bugs.python.org/file44766/issue_28162_01.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue28162>
_______________________________________