UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in position 0: ordinal not in range(128)

Wed Jan 14 05:24:52 EST 2015

On 01/13/2015 10:26 PM, Peng Yu wrote:
> Hi,
>

First, you should always specify your Python version and OS version when 
asking questions here.  Even if you've  been asking questions, many of 
us cannot keep track of everyone's specifics, and need to refer to a 
standard place, the head of the current thread.

I'll assume you're using Python 2.7, on Linux or equivalent.

> I am trying to understand what does encode() do. What are the hex
> representations of "u" in main.py? Why there is UnicodeEncodeError
> when main.py is piped to xxd? Why there is no such error when it is
> not piped? Thanks.
>
> ~$ cat main.py
> #!/usr/bin/env python
>
> u = unichr(40960) + u'abcd' + unichr(1972)
> print u

The unicode characters in 'u' must be decoded to a byte stream before 
sent to the standard out device.  How they're decoded depends on the 
device, and what Python knows (or thinks it knows) about it.

> ~$ cat main_encode.py
> #!/usr/bin/env python
>
> u = unichr(40960) + u'abcd' + unichr(1972)
> print u.encode('utf-8')

Here, print is trying to send bytes to a byte-device, and doesn't try to 
second guess anything.

> $ ./main.py
> ꀀabcd޴
> ~$ cat main.sh
> #!/usr/bin/env bash
>
> set -v
> ./main.py | xxd
> ./main_encode.py | xxd
>
> ~$ ./main.sh
> ./main.py | xxd
> Traceback (most recent call last):
>    File "./main.py", line 4, in <module>
>      print u
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in
> position 0: ordinal not in range(128)
> ./main_encode.py | xxd
> 0000000: ea80 8061 6263 64de b40a                 ...abcd...
>

I'm guessing (since i already guessed you're running on Linux) that in 
the main_encode case, you're printing to a terminal window that Python 
already knows is utf-8.

But in the pipe case, it cannot tell what's on the other side.  So it 
guesses ASCII, and runs into the conversion problem.

(Everything's different in Python 3.x, though in general the problem 
still exists.  If the interpreter cannot tell what encoding is needed, 
it has to guess.)

There are ways to tell Python 2.7 what encoding a given file object 
should have, so you could tell Python to use utf-8 for sys.stdout.  I 
don't know if that's the best answer, but here's what my notes say:

     import sys, codecs
     sys.stdout = codecs.getwriter('utf8')(sys.stdout)

Once you've done that, print output will go through the specified codec 
on the way to the redirected pipe.

-- 
DaveA