[IPython-dev] ASCII Terminal IPython re-encodes bytes greater than 127

Sat Jul 26 14:49:08 EDT 2014

On 26 July 2014 11:31, Thomas Ballinger <tom at hackerschool.com> wrote:

> If I understand correctly, IPython is something like
>
> repr(eval(raw_input('>>> ').decode(sys.stdin.encoding, 'replace')))
>

Yes, that's more or less correct in Python 2. In Python 3, input() returns
unicode, which makes things easier

> and therefore b'þ' in an ascii encoded terminal will end up being the
> unicode replacement character \ufffd because it can't be encoded in ascii,
> the reported encoding. When the code is evaluated, if it's not in a string
> literal it will be a syntax error (though in an ascii terminal this
> traceback can't be written to stdout). If it appears in a unicode literal,
> it's \ufffd, and it it's bytestring literal it's \xef\xbf\xdb, the utf8
> encoding of the previous.
>

If the terminal is really ascii encoded, b'þ' is not even possible in the
first place. If the terminal claims incorrectly to be ascii encoded, then
it's not clear what bytes IPython sees when you type the character þ. The
most likely candidates would be the single byte FE if it's really latin1 or
cp1252, or the two bytes C3 BE if it's really UTF-8. So when IPython tries
to decode it, it will become one or two \ufffd characters.

> This is simpler than the behavior I guessed was happening because I didn't
> look up what \ufffd was (
> http://en.wikipedia.org/wiki/Specials_(Unicode_block) - I wrongly assumed
> ipython was decoding this byte with latin-1 and then re-encoding it with
> utf8).
>
> If one was in a position to reject keys on a byte-by-byte basis (as
> bpython is) might it make sense to simply reject these bytes? If they come
> from the keyboard, they're funny meta key presses (you pressed meta-a; it
> doesn't do anything) and if they come from a paste event, the terminal
> emulator is doing a terrible job encoding into the reported encoding.
> However a few bytes missing would be more confusing though than a few
> characters being replaced with \ufffd.
>
> I think I want to ignore these bytes individually, but replace them with
> \ufffd when they happen in paste events, but I'd love to hear comments on
> this (can take them off this list if they're off topic. Thanks very much
> for input (and for IPython, which is obviously awesome).
>

What system has a terminal that claims to be ASCII but isn't? In my
experience, most terminals on recent systems report either that they are
UTF-8, or one of the Windows code pages.

If the terminal does actually claim to be ASCII when it isn't, I'd consider
that a bug in the terminal, and probably wouldn't feel bad about rejecting
non-ascii keypresses.

If you get paste events as a separate thing, you may be able to retrieve a
unicode string from the clipboard, and avoid going via the terminal's
encoding.

Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20140726/d84ba434/attachment.html>