print u"\u0432": why is this so hard? UnciodeEncodeError

Wed Apr 7 22:02:07 EDT 2004

I have a simple goal. I want the following Python program to work:
  print u"\u0432"

This program fails on my US Debian machine:
  UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)

Actually, I have a complex goal: I want my SOAPpy program to work when
SOAPpy is in debug mode and is printing XML messages out to stdout.
Solving the simple problem will solve the complex one. Since I'm using
third party code, I can't go modify every print statement to call
encode() explictly.

The simplest solution I've come up with is this:
  $ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"'

That seems to work reasonably well in Python 2.3 (but not 2.2!). But
then for some obscure reason if I redirect stdout in my shell it fails.
  $ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"' > /dev/null

Why is that?

The only solution I've found that really works is reassigning
sys.stdout at the top of the script. That's an awful lot of work, but
it's the best I can do for now.

Why is Python not respecting my locale?

Here's my test program:

----------------------------------------------------------------------

#!/bin/bash -x

# Obliterate locale
for e in LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL; do
  unset $e
done

# Doing the obvious thing has nonobvious effects
python2.3 -c 'print u"\u0432"'                               # fails, OK.
LC_ALL=en_US.utf8 python2.3 -c 'print u"\u0432"'             # works!
LC_ALL=en_US.utf8 python2.3 -c 'print u"\u0432"' > /dev/null # fails, huh?

# These both work, but what a pain!
python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'
python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"' > /dev/null

----------------------------------------------------------------------

And sample output:

----------------------------------------------------------------------

~/src/python/testUnicode.sh           
+ unset LANG
+ unset LC_CTYPE
+ unset LC_NUMERIC
+ unset LC_TIME
+ unset LC_COLLATE
+ unset LC_MONETARY
+ unset LC_MESSAGES
+ unset LC_PAPER
+ unset LC_NAME
+ unset LC_ADDRESS
+ unset LC_TELEPHONE
+ unset LC_MEASUREMENT
+ unset LC_IDENTIFICATION
+ unset LC_ALL
+ python2.3 -c 'print u"\u0432"'
Traceback (most recent call last):
  File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)
+ LC_ALL=en_US.utf8
+ python2.3 -c 'print u"\u0432"'
Ð²
+ LC_ALL=en_US.utf8
+ python2.3 -c 'print u"\u0432"'
Traceback (most recent call last):
  File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)
+ python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'
Ð²
+ python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'