[Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

Sun Mar 11 10:38:09 CET 2012

On Sat, Mar 10, 2012 at 08:03:18PM -0500, Dave Angel wrote:

> There are just 256 possible characters in cp1252, and 256 in cp932.

CP932 is also known as MS-KANJI or SHIFT-JIS (actually, one of many 
variants of SHIFT-JS). It is a multi-byte encoding, which means it has 
far more than 256 characters.

http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml
http://en.wikipedia.org/wiki/Shift_JIS

The actual problem the OP has got is that the *multi-byte* sequence he 
is trying to print is illegal when interpreted as CP932. Personally I 
think that's a bug in the terminal, or possibly even print, since he's 
not printing bytes but characters, but I haven't given that a lot of 
thought so I might be way out of line.

The quick and dirty fix is to change the encoding of his terminal, so 
that it no longer tries to interpret the characters printed using CP932. 
That will also mean he'll no longer see valid Japanese characters.

But since he appears to be using Windows, I don't know if this is 
possible, or easy.

[...] 
> You can "solve" the problem by pretending the input file is also cp932 
> when you open it. That way you'll get the wrong characters, but no 
> errors.

Not so -- there are multi-byte sequences that can't be read in CP932.

>>> b"\xe9x".decode("cp932")  # this one works
'騙'
>>> b"\xe9!".decode("cp932")  # this one doesn't
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 0-1: 
illegal multibyte sequence

In any case, the error doesn't occur when he reads the data, but when he 
prints it. Once the data is read, it is already Unicode text, so he 
should be able to print any character. At worst, it will print as a 
missing character (a square box or space) rather than the expected 
glyph. He shouldn't get a UnicodeDecodeError when printing. I smell a 
bug since print shouldn't be decoding anything. (At worst, it needs to 
*encode*.)

-- 
Steven