Python under PowerShell adds characters

Marko Rauhamaa marko at pacujo.net
Thu Mar 30 01:43:46 EDT 2017


Steven D'Aprano <steve at pearwood.info>:

> On Thu, 30 Mar 2017 07:29:48 +0300, Marko Rauhamaa wrote:
>> I'd expect not having to deal with Unicode decoding exceptions with
>> arbitrary input.
>
> That's just silly. If you have *arbitrary* bytes, not all
> byte-sequences are valid Unicode, so you have to expect decoding
> exceptions, if you're processing text.

The input is not in my control, and bailing out may not be an option:

   $ echo $'aa\n\xdd\naa' | grep aa
   aa
   aa
   $ echo $'\xdd' | python2 -c 'import sys; sys.stdin.read(1)'
   $ echo $'\xdd' | python3 -c 'import sys; sys.stdin.read(1)'
   Traceback (most recent call last):
     File "<string>", line 1, in <module>
     File "/usr/lib64/python3.5/codecs.py", line 321, in decode
       (result, consumed) = self._buffer_decode(data, self.errors, final)
   UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 0:
    invalid continuation byte

Note that "grep" is also locale-aware.

>> There recently was a related debate on the Guile mailing list. Like
>> Python3, Guile2 is sensitive to illegal UTF-8 on the command line and
>> in the standard streams. An emacs developer was urging Guile
>> developers to follow emacs's example and support a superset of UTF-8
>> and Unicode where all byte strings can be bijectively mapped into
>> text.
>
> I'd like to read that. Got a link?

<URL:
http://lists.gnu.org/archive/html/guile-user/2017-02/msg00054.html>


Marko



More information about the Python-list mailing list