Python under PowerShell adds characters

Marko Rauhamaa marko at pacujo.net
Thu Mar 30 01:57:00 EDT 2017


Chris Angelico <rosuav at gmail.com>:

> On Thu, Mar 30, 2017 at 4:43 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> The input is not in my control, and bailing out may not be an option:
>>
>>    $ echo
>> aa\n\xdd\naa' | grep aa
>>    aa
>>    aa
>>    $ echo \xdd' | python2 -c 'import sys; sys.stdin.read(1)'
>>    $ echo \xdd' | python3 -c 'import sys; sys.stdin.read(1)'
>>    Traceback (most recent call last):
>>      File "<string>", line 1, in <module>
>>      File "/usr/lib64/python3.5/codecs.py", line 321, in decode
>>        (result, consumed) = self._buffer_decode(data, self.errors, final)
>>    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 0:
>>     invalid continuation byte
>>
>> Note that "grep" is also locale-aware.
>
> So what exactly does byte value 0xDD mean in your stream?
>
> And if you say "it doesn't matter", then why are you assigning meaning
> to byte value 0x0A in your first example? Truly binary data doesn't
> give any meaning to 0x0A.

What I'm saying is that every program must behave in a minimally
controlled manner regardless of its inputs (which are not in its
control). With UTF-8, it is dangerously easy to write programs that
explode surprisingly. What's more, resyncing after such exceptions is
not at all easy. I would venture to guess that few Python programs even
try to do that.


Marko



More information about the Python-list mailing list