Yet another unicode WTF

Ned Deily nad at acm.org
Fri Jun 5 03:03:30 EDT 2009


In article <8763fbmk5a.fsf at benfinney.id.au>,
 Ben Finney <ben+python at benfinney.id.au> wrote:
> Ned Deily <nad at acm.org> writes:
> > $ python2.6 -c 'import sys; print sys.stdout.encoding, \
> >  sys.stdout.isatty()'
> > UTF-8 True
> > $ python2.6 -c 'import sys; print sys.stdout.encoding, \
> >  sys.stdout.isatty()' > foo ; cat foo
> > None False
> 
> So shouldn't the second case also detect UTF-8? The filesystem knows
> it's UTF-8, the shell knows it too. Why doesn't Python know it?

The filesystem knows what is UTF-8?  While the setting of the locale 
environment variables may influence how the file system interprets the 
*name* of a file, it has no direct influence on what the *contents* of a 
file is or is supposed to be.  Remember in python 2.x, a file is a just 
sequence of bytes.  If you want to write encode Unicode to the file, you 
need to use something like codecs.open to wrap the file object with the 
proper streamwriter encoder.

What confuses matters in 2.x is the print statement's under-the-covers 
implicit Unicode encoding for files connected to a terminal:

http://bugs.python.org/issue612627
http://bugs.python.org/issue4947
http://wiki.python.org/moin/PrintFails

>>> x = u'\u0430\u0431\u0432'
>>> print x
[nice looking characters here]
>>> sys.stdout.write(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 
0-2: ordinal not in range(128)
>>> sys.stdout.encoding
'UTF-8'

In python 3.x, of course, the encoding happens automatically but you 
still have to tell python, via the "encoding" argument to open, what the 
encoding of the file's content is (or accept python's default which may 
not be very useful):

>>> open('foo1','w').encoding
'mac-roman'

WTF, indeed.

-- 
 Ned Deily,
 nad at acm.org




More information about the Python-list mailing list